Finding SDSS BALQSOs Using Non-Negative Matrix Factorisation
ABSTRACT Modern spectroscopic databases provide a wealth of information about the physical processes and environments associated with astrophysical populations. Techniques such as blind source separation (BSS), in which sets of spectra are decomposed into a number of components, offer the prospect of identifying the signatures of the underlying physical emission processes. Principle Component Analysis (PCA) has been applied with some success but is severely limited by the inherent orthogonality restriction that the components must satisfy. Non-negative matrix factorisation (NMF) is a relatively new BSS technique that incorporates a non-negativity constraint on its components. In this respect, the resulting components may more closely reflect the physical emission signatures than is the case using PCA. We discuss some of the considerations that must be made when applying NMF and, through its application to the quasar spectra in the Sloan Digital Sky Survey (SDSS) DR6, we show that NMF is a fast method for generating compact and accurate reconstructions of the spectra. The ability to reconstruct spectra accurately has numerous astrophysical applications. Combined with improved SDSS redshifts, we apply NMF to the problem of defining robust continua for quasars that exhibit strong broad absorption line (BAL) systems. The resulting catalogue of SDSS DR6 BAL quasars will be the largest available. Importantly, the NMF approach allows quantitative error estimates to be derived for the Balnicity Indices as a function of key astrophysical and observational parameters, such as the quasar redshifts and the signal-to-noise ratio of the spectra. Comment: 7 pages, 2 figures, to appear in the proceedings of "Classification and Discovery in Large Astronomical Surveys", Ringberg Castle, 14-17 October 2008
arXiv:0810.4231v1 [astro-ph] 23 Oct 2008
Finding SDSS BALQSOs Using Non-Negative
James T. Allen∗, Paul C. Hewett∗, Vasily Belokurov∗and Vivienne Wild†
∗Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge CB3 0HA
†Max-Planck-Institut für Astrophysik, Karl-Schwarzschild Str. 1, 85741 Garching, Germany
Abstract. Modern spectroscopic databases provide a wealth of information about the physical
processes and environments associated with astrophysical populations. Techniques such as blind
source separation (BSS), in which sets of spectra are decomposed into a number of components,
offer the prospect of identifying the signatures of the underlying physical emission processes.
Principle Component Analysis (PCA) has been applied with some success but is severely limited
by the inherent orthogonality restriction that the components must satisfy.
Non-negative matrix factorisation (NMF) is a relatively new BSS technique that incorporates a
non-negativity constraint on its components. In this respect, the resulting components may more
closely reflect the physical emission signatures than is the case using PCA. We discuss some of the
considerations that must be made when applying NMF and, through its application to the quasar
spectra in the Sloan Digital Sky Survey (SDSS) DR6, we show that NMF is a fast method for
generating compact and accurate reconstructions of the spectra.
The ability to reconstruct spectra accurately has numerous astrophysical applications. Combined
with improved SDSS redshifts, we apply NMF to the problem of defining robust continua for
quasars that exhibit strong broad absorption line (BAL) systems. The resulting catalogue of SDSS
DR6 BAL quasars will be the largest available. Importantly, the NMF approach allows quantitative
error estimates to be derived for the Balnicity Indices as a function of key astrophysical and
observational parameters, such as the quasar redshifts and the signal-to-noise ratio of the spectra.
Keywords: methods: statistical, quasars: absorption lines
PACS: 95.75.Pq, 98.54.Aj, 98.62.Ra
The scale of modern spectroscopicsurveys such as theSloan Digital Sky Survey (SDSS)
demand the development and application of new analysis techniques that are able to
condense the vast quantities of available data to extract useful results. Blind source
separation (BSS) techniques are a family of techniques that enable such a condensation
of information. A BSS technique can take a set of spectra written as a matrix, V, and
generateacorrespondingset ofcomponents,H, thatcan belinearlycombinedtorecreate
the original data using a set of weighting coefficients,W:
The form of the components depends on the particular technique used.
The most widely used BSS technique in astronomy is principal component analysis
(PCA), which generates orthogonal components. PCA has been successfully applied for
a number of purposes but interpretation of the component spectra is limited by their
orthogonality. In this work we describe non-negative matrix factorisation, a recently-
developed BSS technique, and demonstrate its application to the reconstruction of ab-
sorbed quasar continua in the SDSS.
NON-NEGATIVE MATRIX FACTORISATION
Non-negative matrix factorisation (NMF) is a relatively new BSS technique that
incorporates a non-negativity constraint on both its components and their weights [1, 2,
3]. The non-negativity constraint is appealing in the context of spectroscopic data as the
physical emission signatures are expected to naturally obey this restriction. Unusually
for a BSS technique, fewer components are generated than there are input spectra.
Starting from random initial matrices, the components (H) and weights (W) follow the
multiplicative update rules
As fewer components are generated than there are input spectra the reconstructions
WH will in general be approximations to the data; the update rules minimisethe error in
the approximations, as measured by the Euclidean distance between the reconstructions
and the data. The random starting conditions result in slightly different components
being generated each time the algorithm is executed, but the resulting reconstructions do
When components and weights have been generated from one set of input spectra,
the components can be applied to generate reconstructions of other spectra. In this case
random initial weights are used, which are updated according to equation 2, while the
components are held fixed.
Practicalities of Applying NMF
Number of Components
In applying NMF to any dataset the number of components must be pre–specified.
Increasing the number of components will always increase the precision of the recon-
structions, but the increased precision is not always desirable. With too many compo-
nents it becomes beneficial for the NMF procedure, in terms of the total error in the
reconstructions, to overfit a small number of particularly noisy or very unusual spectra,
incorporating their noise, or their unique features, in the components. It is thus most ef-
fectiveto employ the maximumnumber of components that does not produce overfitting
in any subset of spectra. The optimal number of components must be chosen separately
for each sample. For reconstructions of the SDSS quasar spectra the number is between
8 and 15, with the appropriate number determined via a simple trial and error scheme.
Some care must be taken when selecting the sample to use as inputs to the NMF
algorithm. A larger sample will better constrain the resulting components, but this must
be balanced against the increased CPU time required for the calculations. A sample size
in a reasonable time on a modern desktop computer.
The range of properties of the sample spectra should reflect the range of properties of
the population from which they are drawn. Individual objects with very unusual proper-
ties can cause problems by inducing overfitting at a smaller number of components than
would otherwise be the case. In such cases the most extreme objects must be removed
from the sample.
Before a source separation can be performed the input spectra must be shifted to their
respective rest frames. Accurate redshifts are required to ensure that specific features
occur in the same location in all spectra. Redshift errors of ∼ 0.001 are sufficient to
blur narrow features when viewed across the sample, reducing the quality of the NMF
In the following work the redshifts used were recalculated from the SDSS spectra in
order to reduce the errors and biases present in the SDSS redshifts, which are each as
large as ±0.005 at redshifts z > 2. At low redshifts the new values were calculated from
the positions of the [O III] and [O II] emission lines, while at higher redshifts improved
cross-correlation measures were developed that are unbiased with respect to the narrow
emission line redshifts at low-redshift.
NMF reconstructs the observed spectra as a linear combination of components, as-
suming all contributions are additive. In contrast, dust reddening multiplies the spectra
by a wavelength-dependent flux ratio. For moderate levels of reddening this effect is
smaller than the natural object-to-object variations between spectra and is accounted for
in the NMF components, but this is not the case for the most heavily obscured objects.
To improve the quality of the reconstructions of heavily reddened objects an estimate
of the unreddened spectrum can be made. By multiplying the observed spectrum by a
power law slope it can be given a slope within the typical range for the object being
observed. The edited spectrum is then used as an input for the NMF algorithm, and the
results can be divided by the same power law slope to match the observed spectrum.
APPLICATION TO BROAD ABSORPTION LINE QUASARS
Broad Absorption Line Quasars
Broad absorption line quasars (BALQSOs) are quasars that exhibit strong broad
absorption line (BAL) systems, with velocity widths often in excess of 10 000 km s−1.
They are always intrinsic to the quasar, although they may be blueshifted with respect
to the active galactic nucleus (AGN) by over 20 000 km s−1. The blueshift is taken to
mean that the BAL systems are part of large-scale outflows driven by the AGN.
Methods of classifying BALQSOs vary, but the most widely used measure is the
balnicity index (BI) , defined as
BI = −
?3000 km s−1
25000 km s−1
where f(v) is the continuum-normalised flux as a function of velocity, v, relative to the
line centre. The constant C is equal to 1 in regions where f(v) has been continuously
less than 0.9 for at least 2000 km s−1, and 0 elsewhere.
Depending on the measure used, the observed BAL fraction is between 10 and 15 per
cent [4, 5, 6]; the fraction increases to 17 to 22 per cent when corrected for differential
selection effects between BAL and non-BAL quasars. The presence of BAL systems in
some, but not all, quasars could be the result of an orientation effect [4, 7], or BALQSOs
could represent a particular stage in the quasar life-cycle [8, 9].
Application of NMF
In orderto characterise BALQSOs, estimatesof theunabsorbed continuaare required,
which we have generated using the NMF technique. Samples of spectra, each consisting
of 500 non-BAL quasars, was selected to cover all redshifts, 0.7 < z < 2.6, in redshift
binsofwidth∆z=0.1. TheNMFalgorithmwas applied toeach sampletoproduce NMF
component spectra, applicable to quasars within a specified redshift interval.
spectra. During this fitting the components were held fixed and only the weights were
initially masked, and the mask was then iteratively updated by comparing the resulting
reconstructed continuum with the observed spectrum. The NMF fitting was recalculated
after each mask update; in most cases the mask locations reached a stable solution after
only two or three iterations.
The classification of quasars into BAL and non-BAL categories is sensitive to the
redshift used in the calculations, as an inaccurate redshift can move absorption features
into or out of the velocity range examined. An inaccurate redshift will also reduce the
quality of any estimate of the continuum. Unfortunately, inaccurate redshifts are more
likely for BALQSOs, as the absorption of the blue wing of the C IV line means that any
method using this line will overestimate the redshift. The redshift measurements used
here do not normally use the regions of the spectra that exhibit broad absorption, so they
are not expected to be significantly affected by this bias.
At redshifts z ≥ 2.6 there are insufficient non-BAL quasars with adequate signal-to-
noise ratio (SNR) available to produce useful components. However, due to the Lyα
forest, no additional quasar “continuum” enters the spectra at these redshifts so the
components from the 2.5 ≤ z < 2.6 redshift bin were used to fit quasars with z ≥ 2.6.
The NMF-procedure described above has been applied to generate estimated continua
for over 80 000 quasar spectra with redshifts 0.7 < z < 4.4. Example reconstructions of
BALQSO continua with a range of redshifts and BI values are shown in Figure 1.
In order to quantify the accuracy with which the continua are reconstructed, a set
of synthetic BALQSO spectra were created. First, continua were fitted to a set of 105
BALQSOs, redshifts 2.25 < z < 2.35, with high SNR spectra, using the procedure
outlined above. A flux-ratio spectrum for each BALQSO was generated by dividing
the observed spectrum by the reconstruction, then smoothing to remove noise. A set of
randomlychosen,non-BAL quasarspectrawere then multipliedby theflux ratiospectra,
generating synthetic BALQSO spectra for which the unabsorbed continua were known.
Continua were fitted to the synthetic BALQSO spectra by the same method as for
genuine BALQSOs, and the resulting BI measurements were compared to the real BI
values of the input flux ratios as a quantitative test of the NMF procedure. The results
are shown in Figure 2 (a). In most cases there is good agreement between the input and
measured values, but the BI is frequently undermeasured. The undermeasurement is due
largely to the effect of noise in the lower SNR spectra: a single pixel with positive noise
that brings its value above 90% of the continuum can prevent a large velocity interval
from contributing to the BI measurement.
Figure 2 (b) shows the BI values measured when the noise is reduced by smoothing
the measured flux ratios in the same way as for the input flux ratios. The smoothing
reduces significantly the number of systems in which the BI is undermeasured, resulting
in an RMS residual of 250 km s−1with a mean offset of only 20 km s−1.
The effect of noise in BI calculations has been noted previously  but to date it
has not been quantified. A forthcoming analysis of the undermeasurement of the BI at
different signal-to-noise ratios will allow an accurate determination of the BAL fraction
as a function of quasar luminosity and redshift.
Non-negative matrix factorisation shows much promise as a technique for analysis
of modern spectroscopic surveys such as the SDSS. We have demonstrated here its
application to the problem of estimating the continua of BALQSOs. Component spectra
were generated from sets of non-BAL quasars, and then used in conjunction with
iteratively defined masks to identify regions of absorption in other quasar spectra and
reconstruct their continua.