Conference PaperPDF Available

A large-scale solar image dataset with labeled event regions

Authors:

Abstract and Figures

This paper introduces a new public benchmark dataset of solar image data from the Solar Dynamics Observatory (SDO) mission. This is the first release, which contains over 15,000 images and nearly 24,000 solar events, spanning the first six months of 2012. It combines region-based event labels from six automated detection modules, ten pre-computed image parameters for each cell over a grid-based segmentation of the full resolution images, and a lower resolution version of the images for further analysis and visualization. Together, these components serve as a standardized, ready-to-use, solar image dataset for general image processing research, without requiring the necessary background knowledge to properly prepare it. We present here the fundamental dataset creation details and outline future improvements and opportunities as data collection continues for the coming years.
Content may be subject to copyright.
A LARGE-SCALE SOLAR IMAGE DATASET WITH LABELED EVENT REGIONS
Michael A. Schuh
Rafal A. Angryk
Karthik Ganesan Pillai
Juan M. Banda
Petrus C. Martens
Dept. Computer Science, Montana State University, Bozeman, MT 59717, USA
Dept. Physics, Montana State University, Bozeman, MT 59717, USA
Harvard-Smithsonian Center for Astrophysics, Cambridge, MA 02138, USA
ABSTRACT
This paper introduces a new public benchmark dataset of so-
lar image data from the Solar Dynamics Observatory (SDO)
mission. This is the first release, which contains over 15,000
images and nearly 24,000 solar events, spanning the first six
months of 2012. It combines region-based event labels from
six automated detection modules, ten pre-computed image
parameters for each cell over a grid-based segmentation of
the full resolution images, and a lower resolution version of
the images for further analysis and visualization. Together,
these components serve as a standardized, ready-to-use, solar
image dataset for general image processing research, without
requiring the necessary background knowledge to properly
prepare it. We present here the fundamental dataset creation
details and outline future improvements and opportunities as
data collection continues for the coming years.
Index Terms computer vision, image processing, data
mining, machine learning, dataset benchmark
1. INTRODUCTION
The Solar Dynamics Observatory (SDO) mission has ush-
ered in the era of big data for solar physics. Capturing over
70,000 high-resolution images of the Sun per day, NASAs
SDO mission will produce more data than all previous solar
data archives combined [1]. This overwhelming amount of
data is impossible to analyze manually, image-by-image, as
was common practice in earlier years. Through necessity, au-
tomated analysis is becoming the norm, utlizing algorithms
from computer vision, image processing, machine learning,
and more. As a result, large-scale solar data analysis is fast
emerging as a novel interdisciplinary research area with am-
ple opportunities for a wide variety of research interests.
The dataset presented here
1
combines the metadata of sev-
eral automated analysis modules that run continously in a
dedicated data pipeline. We use these metadata catalogs to
prune the massive data archive to a more useable form for re-
searchers interested in a hassle-free scientific image dataset
intended for single-image, region-based, event recognition.
1
Available at http://dmlab.cs.montana.edu/solar/data/
Fig. 1. An example SDO image with HEK labeled event regions.
The first release covers a six month period from January 1,
2012 to July 1, 2012, and contains over 15,000 images in two
separate (but visually similar) wavebands and almost 24,000
event instances of six different event types. An example im-
age with matching event instances is shown in Figure 1.
Along with public access and free use, a benchmark
dataset such as this can offer value to potential researchers.
Given the complexity of the image data and event labels,
this brand new real-life dataset presents many opportunities
(and challenges) for new knowledge discovery from big data
in solar physics. We also pose several initial questions to
explore with the dataset, motivating research in classifica-
tion and clustering, image similarity indexing and retrieval,
region-based object detection, and frequent pattern discovery.
It is our hope to establish an interesting, reputable, and
freely-available dataset of practical use for the community.
As data continues to be collected and analyzed over the com-
ing years, we plan to release updated and expanded versions
and alternative variations of the growing dataset based on re-
search directions and community feedback.
2. BACKGROUND
Large-scale image datasets are steadily growing in availabil-
ity and variety thanks to modern technology. This can be seen
in all facets of life, from a personal level, with the popular
crowd-sourced flickr dataset, to a national level, such as satel-
lite imagery, and sometimes even on an international level,
such as the SDO mission upon which our dataset is founded.
Benchmark datasets are crucially important to allow unbi-
ased comparisons of independent research and development
efforts across entire communities. They standardize data that
is often otherwise in-exactly reproduced, causing unfavorable
variabilities in published results at best taking up valuable
space in publications, and at worst altogether undocumented.
Increasingly, novel research is encouraged to present on
highly popular datasets supported by the community, which
further and easily validates the novelty of results and claims.
Exemplary datasets have hundreds of academic citations,
such as the medical ImageCLEF datasets [2] and the natural
scene PASCAL Visual Object Classes (VOC) datasets [3].
All of these datasets offer unique characteristics regarding
the source and context of the images and labels, and this solar
image dataset is no exception.
Launched on February 11, 2010, the SDO mission is the
first mission of NASAs Living With a Star (LWS) program,
a long term project dedicated to studying aspects of the Sun
that significantly affect human life, with the goal of eventu-
ally developing a scientific understanding sufficient for pre-
diction [4]. The SDO is a 3-axis stabilized spacecraft in geo-
synchronous orbit designed to continuously capture full-disk
images of the Sun [5]. It contains three independent instru-
ments (AIA, HMI, and EVE), but our dataset is currently only
from the Atmospheric Imaging Assembly (AIA), which cap-
tures images in ten separate wavebands across the ultra-violet
and extreme ultra-violet spectrum, selected to highlight spe-
cific elements of solar activity [6].
An international consortium of independent groups,
named the SDO Feature Finding Team (FFT), was selected by
NASA to produce a comprehensive set of automated feature
recognition modules [1]. The SDO FFT modules
2
operate
through the SDO Event Detection System (EDS) at the Joint
Science Operations Center (JSOC) of Stanford and Lockheed
Martin Solar and Astrophysics Laboratory (LMSAL), as well
as the Harvard-Smithsonian Center for Astrophysics (CfA),
and NASAs Goddard Space Flight Center (GSFC). Some
modules are provided with specialized access to the raw data
pipeline for stream-like data analysis and event detection.
Even though data is made publicly accessible in a timely
fashion, because of the overall size, only a small window of
data is available for on-demand access, while tapes provide
long-term archival storage.
As one of the 16 SDO FFT modules, our interdisciplinary
research group at Montana State University(MSU)isbuilding
2
http://solar.physics.montana.edu/sol_phys/fft/
Label Name Equation
P1 Entropy E =
P
L1
i=0
p(z
i
) log
2
p(z
i
)
P2 Mean m =
1
L
P
L1
i=0
z
i
P3 Std. Deviation σ =
q
1
L
P
L1
i=0
(z
i
m)
2
P4 Fractal Dim. D
0
= lim
e0
log N(ǫ)
log
1
ǫ
P5 Skewness µ
3
=
P
L1
i=0
(z
i
m)
3
p(z
i
)
P6 Kurtosis µ
4
=
P
L1
i=0
(z
i
m)
4
p(z
i
)
P7 Uniformity U =
P
L1
i=0
p
2
(z
i
)
P8 Rel. Smoothness R = 1
1
1+σ
2
(z)
P9 T. Contrast *see Tamura [7]
P10 T. Directionality *see Tamura [7]
Table 1. Image Parameters, where L stands for the number of pix-
els in the cell, z
i
is the i-th pixel value, m is the mean, and p(z
i
) is
the grayscale histogram representation of z at i. The fractal dimen-
sion is calculated based on the box-counting method where N(e) is
the number of boxes of side length e required to cover the image cell.
a “Trainable Module” for use in the first ever Content-Based
Image Retrieval (CBIR) system for solar images – now online
and publicly available
3
. We operate at a static six minute ca-
dence in the data pipeline on all 10 AIA wavelengths. Each
4096 × 4096 pixel image is segmented by a fixed-size grid,
independent of any dynamic characteristics of the specific
image due to long-term, real-time, stream-processing con-
straints. The 64 × 64 grid creates 4096 cells per image and
our 10 image parameters (listed in Table 1) are calculated for
each cell. This results in roughly 240 images (nearly one mil-
lion image cells with 10 parameter values each) per 10 waves
per day, or over 800,000 images per year.
In previous work, we evaluated a variety of possible image
parameters to extract from the solar images. Given the vol-
ume and velocity of the data stream, the best ten parameters
were chosen based on not only their classification accuracy,
but also their processing time [8, 9]. Preliminary event clas-
sification was performed on a limited set of human-labeled
partial-disk images from the TRACE mission [10] to deter-
mine which image parameters best represented the phenom-
ena [11, 12]. A later investigation of solar filament classifica-
tion in H-alpha images from the Big Bear Solar Observatory
(BBSO) showed similar success, even with noisy region la-
bels and a small subset of our ten image parameters [13].
3. THE DATA
Here we discuss in detail the steps taken to create this dataset.
While this is beneficial for reproducibility, our intent is more
so to provide enough assurances to the researcher in our data
curation methodologies and decisions. This also provides the
reader practical knowledge of working with the several un-
derlying large-scale data repositories.
3
http://cbsir.cs.montana.edu/sdocbir
Fig. 2. Example heatmap plots of an extracted image (in 64 × 64 cells) for each of our ten image parameters.
3.1. Collection
All data comes from SDO FFT modules, either already
available in-house at MSU or through the Heliophysics
Event Knowledgebase (HEK), which is a centralized archive
of metadata accessible online [14]. The HEK is an all-
encompassing, cross-mission metadata repository of solar
event reports and related information. This metadata can be
downloaded manually through the official web interface
4
,
but after finding several limitations towards large-scale event
retrieval, we instead developed our own open source and pub-
licly available software application named “Query HEK”, or
simply QHEK
5
.
We retrieved only event reports fromautomated SDO FFT
modules for six types of solar events: active region (AR),
coronal hole (CH), filament (FI), flare (FL), sigmoid (SG),
and sunspot (SS). These specific events were chosen for sev-
eral reasons. From a science perspective, all of these events
are identifiable in static images, without requiring spatial or
temporal features. These events are also, generally speaking,
traditionally well-studied and frequently occuring at least
enough so that a dedicated module was created for the sole
purpose of detecting such solar phenomena. From a practical
perspective, this meant a larger possible set of reported event
instances from well-known and well-performing automated
modules that never waiver or tire, unlike graduate students.
A summary of the event types can be found in Table 2,
which states the number of eventinstances and reported wave-
band. The dataset contains images in 131
˚
A and 193
˚
A wave-
bands that match any unique event timestamp. So the 23,517
total events represent 47,034 labels (each applied twice), but
only 17,785 are true labels AR, CH, FL, and SG events in
4
http://www.lmsal.com/isolsearch
5
Available at http://dmlab.cs.montana.edu/qhek
their given wave. Note that FL and SG do not include event
outlines, or chain codes (CC), and that FI and SS are reported
from entirely different instrumentation. These events are in-
cluded because of their abundance and importance to solar
physics, and for the potential of novel knowledge discovery
from data. Future dataset releases will likely include images
from other wavebands and instruments.
Event Name Reported CC Instances
AR Active Region 193
˚
A Yes 7108
CH Coronal Hole 193
˚
A Yes 4702
FI Filament H-alpha Yes 4218
FL Flare 131
˚
A No 4316
SG Sigmoid 131
˚
A No 1659
SS Sunspot HMI Yes 1514
Table 2. A summary of the different event types in the dataset,
where CC denotes having a detailed event boundary outline.
3.2. Transformation
We first had to standardize attributes across all event types
due to independent reporting styles, such as the wavelength
attribute values. For simplicity, we then discarded event in-
stances reported in other waves, retaining the majority of total
instances, but only using the two most popular waves.
Three spatial attributes define the event location on the
solar disk, with the center point and minimum bounding rect-
angle (MBR) required. These attributes are given as geomet-
ric object strings, encapsulated by the words ”POINT()” and
”POLYGON()”, where point contains a single (x, y) pair of
pixel coordinates and polygon contains any number of pairs
listed sequentially, e.g., (x
1
, y
1
, x
2
, y
2
, ..., x
n
, y
n
). When the
polygon is used for the MBR attribute, it always contains five
vertices, where the first and last are identical, while a polygo-
nal chain code can contain any number of points. We convert
all spatial attributes from the helioprojective cartesian (HPC)
coordinate system to pixel-based coordinates based on image-
specific metadata [15, 16]. This process removes the need for
any further spatial positioning transformations.
For each event instance, we find the midpoint of the
event’s duration and round to the nearest minute. We then
record all events that cover each unique time, and find the
nearest image for each wave. This is combined to form a list
of events for each unique image in each wave. We note that
this is a many-to-many relationship, i.e., an image may have
many assocationed events, and an event may be associated
to many images. This means a single event instance might
derive more than just two labels (one per wave as previously
stated). Also, because we round times to the nearest minute,
we buffer the event durations by ±2 minutes so instantaneous
events are not lost when matching times.
Fig. 3. Frequency of reports for all six event types over 90 days.
3.3. Formats
The dataset is released as raw text and image files. All events
can be found in the events.csv file with the first line as the
column headers. Thumbnail image and parameter data files
end in
th.png and .txt respectively. Labels are provided in
the labels.txt file, which contains one line for each image in
the dataset (first value is the image file name), followed by a
list of event IDs matched to the image. The entire dataset is
available online
6
in a lossless compressed archive. Alterna-
tive formats available include SQL files for database useage
and direct image cell feature vectors (as ARFF files for Weka
[17]). Variations of the dataset will also be available, such as
a class-balanced dataset for standard use by the community.
3.4. Uses
There are a variety of directions and domains this dataset can
facilitate research in. At MSU, we use these data products to
6
http://dmlab.cs.montana.edu/solar/data/
deliver and better develop our solar CBIR system [18]. It is
also useful for benchmarking classification [13] and indexing
effectiveness [19], but plenty of other interesting applications
in image processing, machine learning, and data mining are
possible.
Additionally, several science-related questions can now
be investigated through exploratory analysis as beneficial in-
troductory work with this dataset. For example:
What combination of image parameters (and algo-
rithms) work best for which types of events?
How well can certain types of events be recognized in
images they were not reported in? (e.g. FI and SS)
Are there any event interdependencies in multi-label re-
gions? Can this extend to spatial relationships (overlap,
envelope, neighbor, etc.)?
Quiet Sun analysis is there really such a thing? Inves-
tigating regions of low/no event activity, much like an
event type of its own.
4. CONCLUSION AND FUTURE WORK
This paper introduced the first version of a new large-scale so-
lar image dataset, featuring full-disk images of the Sun, pre-
computed grid-based image cell signatures, and multi-class
region-based event labels. As an extension of previous work
[13], an upcoming publication using this dataset includes a
detailed statistical analysis of image parameters and events,
as well as basic data mining and machine learning of multi-
event regions. In related works, this dataset is also being ex-
tended for use as an “event tracking” dataset, introducing de-
pendencies on spatial and temporal attributes and exploring
the possibility of mining spatio-temporal co-occurrance pat-
terns (STCOPs) in solar physics [20, 21].
By introducing this ready-to-use dataset to the public, we
hope to interest more researchers from various backgrounds
(computer vision, machine learning, data mining, etc.) in the
domain of solar physics, further bridging the gap between
many interdisciplinary and mutually-beneficial research do-
mains. In the future, we plan to extend the dataset with: (1)
a longer timeframe of up-to-date and labeled data, (2) more
observations from other instruments onboard SDO and else-
where, and (3) more types of events and additional event-
specific attributes for extended analysis of event “sub-type”
characteristics. Further information and news about dataset
updates and uses will be maintained online with the dataset
for timely dissemination. We welcome and encourage the
community to provide feedback about the dataset, including
ideas for alternative formats and future improvements.
5. ACKNOWLEDGEMENTS
This work was supported in part by two NASA Grant Awards:
1) No. NNX09AB03G, and 2) No. NNX11AM13A.
6. REFERENCES
[1] P. C. H. Martens, G. D. R. Attrill, A. R. Davey, A. Engell,
S. Farid, P. C. Grigis, et al., “Computer vision for the solar
dynamics observatory (SDO),Solar Physics, Jan 2011.
[2] W. Hersh, H. Mller, and J. Kalpathy-Cramer, “The image-
clefmed medical image retrieval task test collection, Journal
of Digital Imaging, vol. 22, pp. 648–655, 2009.
[3] M. Everingham, L. Gool, C. Williams, J. Winn, and A. Zis-
serman, “The pascal visual object classes (voc) challenge,In-
ternational Journal of Computer Vision, vol. 88, pp. 303–338,
2010.
[4] G. L. Withbroe, “Living With a Star, in AAS/Solar Physics
Division Meeting #31, vol. 32 of Bulletin of the American As-
tronomical Society, p. 839, May 2000.
[5] W. Pesnell, B. Thompson, and P. Chamberlin, “The solar dy-
namics observatory (sdo), Solar Physics, vol. 275, pp. 3–15,
2012.
[6] J. Lemen, A. Title, D. Akin, P. Boerner, C. Chou, et al., “The
Atmospheric Imaging Assembly (AIA) on the Solar Dynamics
Observatory (SDO),Solar Physics, vol. 275, pp. 17–40, 2012.
[7] H. Tamura, S. Mori, and T. Yamawaki, “Texture features corre-
sponding to visual perception,IEEE Transactions on Systems,
Man, and Cybernetics, vol. 8, no. 6, pp. 460–472, 1978.
[8] J. M. Banda and R. A. Angryk, “Selection of image parameters
as the first step towards creating a CBIR system for the solar
dynamics observatory, in International Conference on Digi-
tal Image Computing: Techniques and Applications (DICTA),
pp. 528–534, 2010.
[9] J. M. Banda and R. A. Angryk, An experimental evaluation
of popular image parameters for monochromatic solar image
categorization, in The 23rd Florida Artificial Intelligence Re-
search Society Conference (FLAIRS), pp. 380–385, 2010.
[10] B. Handy, L. Acton, C. Kankelborg, C. Wolfson, D. Akin,
et al., “The transition region and coronal explorer, Solar
Physics, vol. 187, pp. 229–260, 1999.
[11] J. M. Banda, R. A. Angryk, and P. C. H. Martens, “On the sur-
prisingly accurate transfer of image parameters between med-
ical and solar images, in 18th IEEE Int. Conf. on Image Pro-
cessing (ICIP), pp. 3669–3672, 2011.
[12] J. M. Banda, R. A. Angryk, and P. C. H. Martens, “Steps to-
ward a large-scale solar image data analysis to differentiate so-
lar phenomena,Solar Physics, pp. 1–28, 2013.
[13] M. A. Schuh, J. M. Banda, P. N. Bernasconi, R. A. Angryk,
and P. C. H. Martens, A comparative evaluation of automated
solar filament detection.” under review, 2013.
[14] N. Hurlburt, M. Cheung, C. Schrijver, L. Chang, S. Freeland,
et al., “Heliophysics event knowledgebase for solar dynamics
observatory (SDO) and beyond,Solar Physics, 2010.
[15] W. Thompson, “Coordinate systems for solar image data,As-
tronomy and Astrophysics, vol. 449, no. 2, pp. 791–803, 2006.
[16] W. D. Pence, “Cfitsio, v2.0: A new full-featured data inter-
face, in Astronomical Data Analysis Software and Systems,
(California), 1999.
[17] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I. Witten, “The WEKA data mining software: An update,
SIGKDD, 2009.
[18] J. M. Banda, M. A. Schuh, T. Wylie, P. McInerney, and R. A.
Angryk, “When too similar is bad: A practical example of the
solar dynamics observatory content-based image-retrieval sys-
tem, in 17th East-European Conf. on Advances in Databases
and Information Systems (ADBIS), 2013.
[19] M. A. Schuh, T. Wylie, and R. A. Angryk, “Improving per-
formance of high-dimensional knn retrieval through localized
dataspace segmentation and hybrid indexing., in 17th East-
European Conf. on Advances in Databases and Information
Systems (ADBIS), 2013.
[20] K. G. Pillai, R. A. Angryk, J. M. Banda, M. A. Schuh, and
T. Wylie, “Spatio-temporal co-occurrence pattern mining in
data sets with evolving regions, in ICDM Workshops, 2012,
In press.
[21] K. G. Pillai, R. A. Angryk, J. M. Banda, T. Wylie, and M. A.
Schuh, “Spatio-temporal co-occurrence rules, in 17th East-
European Conf. on Advances in Databases and Information
Systems (ADBIS), 2013.
... One challenge for supervised machine learning methods is that they require sufficient labeled data for the model to learn, but it is challenging to collect datasets describing extreme events that rarely happen. To address this issue, benchmark datasets have been produced, including a solar image benchmarking dataset based on the Solar Dynamics Observatory (SDO; Schuh et al. 2013), surveillance video of 23 types of outdoor activities (Oh et al. 2011), and a change extraction benchmark video dataset with six categories of changes (Goyette et al. 2012). More benchmarking datasets are still in demand to promote a positive and competitive environment for scientific research to produce more effective machine learning methods for event extraction. ...
... Khanna and Cheema 2013;Schuh et al. 2013;Tran- The and Zettsu 2014;Poslad et al. 2015;Dutta et al. 2017) Health monitoring and management Social sensing Rule-based(Kilicoglu and Bergler 2009;Trick 2013;Shimabukuro et al. 2015;Christian et al. 2017) Statistical and probabilistic(Zadeh et al. 2019) Simulation(Duan et al. 2013) Machine learning(Missier et al. 2016) Health sensing Statistical (Varatharajan et al. 2018) Machine learning (Kautz, Groh, and Eskofier 2015; Rodrigues et al. 2015; Groh, Fleckenstein, and Eskofier 2016; Oyana et al. 2017; Soares, Dewalle, and Marsh 2017; Schinasi et al. 2018; Birenboim et al. 2019) Urban intelligence monitoring and management Remote sensing Rule-based (Ho et al. 2012; Ge et al. 2015; Guille and Favre 2015) Statistical and probabilistic (Chierichetti et al. 2014) Machine learning (Khaleghi et al. 2013; Alam et al. 2017; Lo, Chiong, and Cornforth 2017) Social sensing Statistical and probabilistic (Traag et al. 2011) Machine learning (Zhang 2010; Zhuang et al. 2010; Heittola et al. 2011; Atefeh and Khreich 2015; Zhang et al. 2015; Panagiotou et al. 2016; Resch et al. 2016; Costa et al. 2018) In situ sensing Image processing (Wang and Snoussi 2014) Machine learning (Filipponi et al. 2010; Oh et al. 2011; Goyette et al. 2012; Yu, Sun, and Cheng 2012; Jin et al. 2014; Ren et al. 2015; Chang et al. 2016; Cruz-Albarran et al. 2017; Huang et al. 2017; Bok, Kim, and Yoo 2018; Briassouli 2018; Sang, Shi, and Liu 2018; Feller et al. 2019) Health sensing Rule-based (Resch et al. 2015a, 2015b; Osborne and Jones 2017; Werner, Resch, and Loidl 2019) Business intelligence Social sensing Machine learning (Shroff, Agarwal, and Dey 2011; Verma et al. 2015; Nguyen and Grishman 2016; Lanza-Cruz, Berlanga, and Aramburu 2018) Simulation (Ruiz et al. 2011; Zechman 2011) In situ sensing Machine learning (Estruch and Álvaro 2012; Kezunovic, Xie, and Grijalva 2013; Akila, Govindasamy, and Sandosh 2016; Ren and Wang 2019) Crisis, crime and social unrest monitoring Remote sensing Machine learning (Witmer 2015; Tapete and Cigna 2018) Social sensing Rule-based (Jayarajah et al. 2015) Machine learning (Li, Chen, and Feng 2012; Artikis et al. 2012; Sakai and Tamura 2015; Ristea et al. 2018) In situ sensing Machine learning (Li, Chen, and Feng 2012) Health sensing Rule-based (López-Cuevas et al. 2017) ...
Article
Full-text available
The advancements of sensing technologies, including remote sensing, in situ sensing, social sensing, and health sensing, have tremendously improved our capability to observe and record natural and social phenomena, such as natural disasters, presidential elections, and infectious diseases. The observations have provided an unprecedented opportunity to better understand and respond to the spatiotemporal dynamics of the environment, urban settings, health and disease propagation, business decisions, and crisis and crime. Spatiotemporal event detection serves as a gateway to enable a better understanding by detecting events that represent the abnormal status of relevant phenomena. This paper reviews the literature for different sensing capabilities, spatiotemporal event extraction methods, and categories of applications for the detected events. The novelty of this review is to revisit the definition and requirements of event detection and to layout the overall workflow (from sensing and event extraction methods to the operations and decision-supporting processes based on the extracted events) as an agenda for future event detection research. Guidance is presented on the current challenges to this research agenda, and future directions are discussed for conducting spatiotemporal event detection in the era of big data, advanced sensing, and artificial intelligence.
... As was mentioned in Section I, the ten parameters were originally evaluated for use in solar content based image retrieval of full-disk images in [6] and [20], and were later shown to be useful in identifying similar regions of SDO AIA images in [5] and [7]. These precalculated parameters were made available in [21], with an extended dataset made available for a larger date range in [22], [23], and finally made available online through an API 2 by [24]. ...
Article
Full-text available
This paper focuses on the problem of tracking solar phenomena by creating spatiotemporal trajectories from solar event detection reports. Though tracking of multiple objects in video sequences has seen much research and improvement in recent years, there has been relatively little focus in the domain of tracking solar phenomena (events). In this work, we improve upon our previous endeavors by eliminating offline model training requirements and utilizing crowd-sourced human labels to evaluate our performance. We apply our method to the metadata of two solar event types spanning four years of detection reports from the automated detection modules for the Solar Dynamics Observatory (SDO) mission. We compare our results with those produced by the detection module for active regions and coronal holes by using a crowd-sourced trajectory database as the ground truth. We show that our results are as good or better than the event-specific detection module for these two event types. This is especially promising because our tracking algorithm is a generalized module for all solar events, and not specific to a single event type allowing it to be applied to other solar event types reported to Heliophysics Event Knowledgebase that do not contain tracking information.
... 6.1, this obstacle can be removed and we would be able to attract top researchers doing state-of-the-art work to participate in the efforts of our community. In 2013 there has been a first effort to release datasets [Schuh et al., 2013], but it was not until Kucuk et al. [2017] that a large enough dataset became easily accessible to the computer vision community. ...
Preprint
Full-text available
The authors of this report met on 28-30 March 2018 at the New Jersey Institute of Technology, Newark, New Jersey, for a 3-day workshop that brought together a group of data providers, expert modelers, and computer and data scientists, in the solar discipline. Their objective was to identify challenges in the path towards building an effective framework to achieve transformative advances in the understanding and forecasting of the Sun-Earth system from the upper convection zone of the Sun to the Earth's magnetosphere. The workshop aimed to develop a research roadmap that targets the scientific challenge of coupling observations and modeling with emerging data-science research to extract knowledge from the large volumes of data (observed and simulated) while stimulating computer science with new research applications. The desire among the attendees was to promote future trans-disciplinary collaborations and identify areas of convergence across disciplines. The workshop combined a set of plenary sessions featuring invited introductory talks and workshop progress reports, interleaved with a set of breakout sessions focused on specific topics of interest. Each breakout group generated short documents, listing the challenges identified during their discussions in addition to possible ways of attacking them collectively. These documents were combined into this report-wherein a list of prioritized activities have been collated, shared and endorsed.
... In Figure 2.3, we show how these 10 sets of parameters stack to produce an image data cube for each image. A set of these parameter values was calculated over a subset of the images captured by SDO and made available by [26]. Later, this set was extended to encompass a larger date range in [27][28][29], and finally made available online through an API 3 by [30]. ...
Thesis
Full-text available
Comparing regions of images is a fundamental task in both similarity based object tracking as well as retrieval of images from image datasets, where an exemplar image is used as the query. In this thesis, we focus on the task of creating a method of comparison for images produced by NASA’s Solar Dynamic Observatory mission. This mission has been in operation for several years and produces almost 700 Gigabytes of data per day from the Atmospheric Imaging Assembly instrument alone. This has created a massive repository of high-quality solar images to analyze and categorize. To this end, we are concerned with the creation of image region descriptors that are selective enough to differentiate between highly similar images yet compact enough to be compared in an efficient manner, while also being indexable with current indexing technology. We produce such descriptors by pooling sparse coding vectors produced by spanning learned basis dictionaries. Various pooled vectors are used to describe regions of images in event tracking, entire image descriptors for image comparison in content based image retrieval, and as region descriptors to be used in a content based image retrieval system on the SDO AIA image pipeline.
... The image size is 4096 × 4096 pixels, which provides a total of 1.5 terabytes compressed data per day. An uncompressed and pre-processed version of the data can be obtained from Schuh et al. (2013). Here each image was partitioned into 64 × 64 squared and equi-sized sub-images, each consists of 64 × 64 pixels. ...
Preprint
It is not unusual for a data analyst to encounter data sets distributed across several computers. This can happen for reasons such as privacy concerns, efficiency of likelihood evaluations, or just the sheer size of the whole data set. This presents new challenges to statisticians as even computing simple summary statistics such as the median becomes computationally challenging. Furthermore, if other advanced statistical methods are desired, novel computational strategies are needed. In this paper we propose a new approach for distributed analysis of massive data that is suitable for generalized fiducial inference and is based on a careful implementation of a "divide and conquer" strategy combined with importance sampling. The proposed approach requires only small amount of communication between nodes, and is shown to be asymptotically equivalent to using the whole data set. Unlike most existing methods, the proposed approach produces uncertainty measures (such as confidence intervals) in addition to point estimates for parameters of interest. The proposed approach is also applied to the analysis of a large set of solar images.
Article
It is not unusual for a data analyst to encounter data sets distributed across several computers. This can happen for reasons such as privacy concerns, efficiency of likelihood evaluations, or just the sheer size of the whole data set. This presents new challenges to statisticians as even computing simple summary statistics such as the median becomes computationally challenging. Furthermore, if other advanced statistical methods are desired, novel computational strategies are needed. In this paper we propose a new approach for distributed analysis of massive data that is suitable for generalized fiducial inference and is based on a careful implementation of a “divide and conquer” strategy combined with importance sampling. The proposed approach requires only small amount of communication between nodes, and is shown to be asymptotically equivalent to using the whole data set. Unlike most existing methods, the proposed approach produces uncertainty measures (such as confidence intervals) in addition to point estimates for parameters of interest. The proposed approach is also applied to the analysis of a large set of solar images.
Conference Paper
The ease of use and the capabilities of image editing tools has raised the challenges in the emerging field of digital image forensics related to scanned documents. Unfortunately, the universality of current methods and their applicability in real world scenarios have not been proven yet due to the absence of a standardized image database. In this paper, we introduce a novel test collection of more than 4500 images annotated with respect to each scanner as a useful tool for forensics in- vestigators to test and compare scanner-based forensic tech- niques. It is an image database that contains document of various content scanned with more than one resolution with 11 different scanner instances of widely known brands. This selection is based on our latest work on the identification of scanners at the origin of digitized documents and is adapted to fit any source scanner identification technique. The SUPATLANTIQUE database is available for free to the research community and is intended to become a useful and a reference resource for researchers in this field.
Chapter
Spatiotemporal data mining refers to the extraction of knowledge, regularly repeating relationships, and interesting patterns from data with spatial and temporal aspects. In recent years, many spatiotemporal frequent pattern mining algorithms were developed for spatiotemporal event instances represented by a series of region objects that evolves over time. These algorithms focus on the discovery of spatiotemporal co-occurrence patterns and event sequences by inspecting the spatiotemporal overlap and follow relationships. Before moving onto these relationships, we will demonstrate different types of spatiotemporal knowledge to place the relationships and methods in the greater context. This chapter provides a bird-eye view on the output of spatiotemporal data mining techniques in the literature, gives rationale for mining spatiotemporal patterns from evolving regions, and explains the challenges of mining patterns from evolving region data.
Article
Full-text available
The National Aeronautics Space Agency (NASA) Solar Dynamics Observatory (SDO) mission has given us unprecedented insight into the Sun’s activity. By capturing approximately 70,000 images a day, this mission has created one of the richest and biggest repositories of solar image data available to mankind. With such massive amounts of information, researchers have been able to produce great advances in detecting solar events. In this resource, we compile SDO solar data into a single repository in order to provide the computer vision community with a standardized and curated large-scale dataset of several hundred thousand solar events found on high resolution solar images. This publicly available resource, along with the generation source code, will accelerate computer vision research on NASA’s solar image data by reducing the amount of time spent performing data acquisition and curation from the multiple sources we have compiled. By improving the quality of the data with thorough curation, we anticipate a wider adoption and interest from the computer vision to the solar physics community.
Article
Full-text available
The Solar Dynamics Observatory (SDO) was launched on 11 February 2010 at 15:23 UT from Kennedy Space Center aboard an Atlas V 401 (AV-021) launch vehicle. A series of apogee-motor firings lifted SDO from an initial geosynchronous transfer orbit into a circular geosynchronous orbit inclined by 28° about the longitude of the SDO-dedicated ground station in New Mexico. SDO began returning science data on 1 May 2010. SDO is the first space-weather mission in NASA's Living With a Star (LWS) Program. SDO's main goal is to understand, driving toward a predictive capability, those solar variations that influence life on Earth and humanity's technological systems. The SDO science investigations will determine how the Sun's magnetic field is generated and structured, how this stored magnetic energy is released into the heliosphere and geospace as the solar wind, energetic particles, and variations in the solar irradiance. Insights gained from SDO investigations will also lead to an increased understanding of the role that solar variability plays in changes in Earth's atmospheric chemistry and climate. The SDO mission includes three scientific investigations (the Atmospheric Imaging Assembly (AIA), Extreme Ultraviolet Variability Experiment (EVE), and Helioseismic and Magnetic Imager (HMI)), a spacecraft bus, and a dedicated ground station to handle the telemetry. The Goddard Space Flight Center built and will operate the spacecraft during its planned five-year mission life; this includes: commanding the spacecraft, receiving the science data, and forwarding that data to the science teams. The science investigations teams at Stanford University, Lockheed Martin Solar Astrophysics Laboratory (LMSAL), and University of Colorado Laboratory for Atmospheric and Space Physics (LASP) will process, analyze, distribute, and archive the science data. We will describe the building of SDO and the science that it will provide to NASA. © Springer Science+Business Media B.V. 2012. All rights reserved.
Chapter
Full-text available
Spatiotemporal co-occurrence rules (STCORs) discovery is an important problem in many application domains such as weather monitoring and solar physics, which is our application focus. In this paper, we present a general framework to identify STCORs for continuously evolving spatiotemporal events that have extended spatial representations. We also analyse a set of anti-monotone (monotonically non-increasing) and non anti-monotone measures to identify STCORs. We then validate and evaluate our framework on a real-life data set and report results of the comparison of the number candidates needed to discover actual patterns, memory usage, and the number of STCORs discovered using the anti-monotonic and non anti-monotonic measures.
Chapter
Full-text available
The measuring of interest and relevance have always been some of the main concerns when analyzing the results of a Content-Based Image-Retrieval (CBIR) system. In this work, we present a unique problem that the Solar Dynamics Observatory (SDO) CBIR system encounters: too many highly similar images. Producing over 70,000 images of the Sun per day, the problem of finding similar images is transformed into the problem of finding similar solar events based on image similarity. However, the most similar images of our dataset are temporal neighbors capturing the same event instance. Therefore a traditional CBIR system will return highly repetitive images rather than similar but distinct events. In this work we outline the problem in detail, present several approaches tested in order to solve this important image data mining and information retrieval issue.
Conference Paper
Full-text available
Spatio-temporal co-occurring patterns represent subsets of event types that occur together in both space and time. In comparison to previous work in this field, we present a general framework to identify spatio-temporal co-occurring patterns for continuously evolving spatio-temporal events that have polygon-like representations. We also propose a set of measures to identify spatio-temporal co-occurring patterns and propose an Apriori-based spatio-temporal co-occurrence mining algorithm to find prevalent spatio-temporal co-occurring patterns for extended spatial representations that evolve over time. We evaluate our framework on real-life data to demonstrate the effectiveness of our measures and the algorithm. We present results highlighting the importance of our measures in identifying spatio-temporal co-occurrence patterns.
Article
Full-text available
We detail the investigation of the first application of several dissimilarity measures for large-scale solar image data analysis. Using a solar-domain-specific benchmark dataset that contains multiple types of phenomena, we analyzed combinations of image parameters with different dissimilarity measures to determine the combinations that will allow us to differentiate between the multiple solar phenomena from both intra-class and inter-class perspectives, where by class we refer to the same types of solar phenomena. We also investigate the problem of reducing data dimensionality by applying multi-dimensional scaling to the dissimilarity matrices that we produced using the previously mentioned combinations. As an early investigation into dimensionality reduction, we investigate by applying multidimensional scaling (MDS) how many MDS components are needed to maintain a good representation of our data (in a new artificial data space) and how many can be discarded to enhance our querying performance. Finally, we present a comparative analysis of several classifiers to determine the quality of the dimensionality reduction achieved with this combination of image parameters, similarity measures, and MDS.
Conference Paper
Full-text available
We present a comparative evaluation for automated filament detection in H-alpha solar images. By using metadata produced by the Advanced Automated Filament Detection and Characterization Code (AAFDCC) module, we adapted our Trainable Feature Recognition (TFR) component to accurately detect regions in solar images containing filaments. We first analyze the module's metadata and then transform it into labeled datasets for machine learning classification. Visualizations of data transformations and classification results are presented and accompanied by statistical findings. Our results confirm the reliable event reporting of the AAFDCC module as well as our ability to effectively detect solar filaments with our TFR component.
Article
Full-text available
This volume is dedicated to the Solar Dynamics Observatory (SDO), which was launched 11 February 2010. The articles focus on the spacecraft and its instruments: the Atmospheric Imaging Assembly (AIA), the Extreme Ultraviolet Variability Experiment (EVE), and the Helioseismic and Magnetic Imager (HMI). Articles within also describe calibration results and data processing pipelines that are critical to understanding the data and products, concluding with a description of the successful Education and Public Outreach activities. This book is geared towards anyone interested in using the unprecedented data from SDO, whether for fundamental heliophysics research, space weather modeling and forecasting, or educational purposes. Previously published in Solar Physics journal, Vol. 275/1-2, 2012.
Conference Paper
Efficient data indexing and nearest neighbor retrieval are challenging tasks in high-dimensional spaces. This work builds upon our previous analyses of iDistance partitioning strategies to develop the backbone of a new indexing method using a heuristic-guided hybrid index that further segments congested areas of the dataspace to improve overall performance for exact k-nearest neighbor (kNN) queries. We develop data-driven heuristics to intelligently guide the segmentation of distance-based partitions into spatially disjoint sections that can be quickly and efficiently pruned during retrieval. Extensive tests are performed on k-means derived partitions over datasets of varying dimensionality, size, and cluster compactness. Experiments on both real and synthetic high-dimensional data show that our new index performs significantly better on clustered data than the state-of-the-art iDistance indexing method.
Article
The Transition Region and Coronal Explorer (TRACE), launched 1 April 1998, will have at the time of this meeting been in orbit for just over 8 months. In that time, the instrument will have taken over 500,000 exposures of the sun in ultraviolet and extreme ultraviolet wavelengths, will have completed three-forths of the nominal mission and will be approaching the end of the first eclipse season. The TRACE telescope is unique in its ability to observe in UV and EUV wavelengths at high cadence with unprecedented resolution. We present a review of the TRACE instrument and show current observations and results. We discuss the performance of the instrument in terms of observational capabilities, sensitivity, calibration, effects of aging on the instrument, CCD effects, and contamination effects.
Article
NASA has proposed a new initiative, Living With a Star (LWS), a research and development program involving studying solar variability as it affects human technology, humans in space, and terrestrial climate. The goal of the initiative is to develop a capability to observe, understand, and predict the aspects of the connected Sun-Earth system that affect life and society. The initiative includes the following elements, (a) expanded utilization of the Solar Terrestrial Probe missions, (b) establishing a Space Weather Research Network with solar and geospace missions designed to address scientific research problems relevant to the above goal (c) data analysis/modeling targeted on scientific problems relevant to the goal of the program (d) Orbiting Environmental Testbeds for testing rad-hard and rad-tolerant systems, and (e) partnering with other agencies and industry.