Alexander S. SzalayJohns Hopkins University | JHU · Department of Physics and Astronomy
Alexander S. Szalay
PhD
About
891
Publications
118,300
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
112,915
Citations
Introduction
Additional affiliations
January 1987 - present
August 1984 - January 1986
January 1981 - June 1982
Education
September 1972 - June 1975
Publications
Publications (891)
We present a scheme based on artificial neural networks (ANNs) to estimate the line-of-sight velocities of individual galaxies from an observed redshift–space galaxy distribution. We find an estimate of the peculiar velocity at a galaxy based on galaxy counts and barycentres in shells around it. By training the network with environmental characteri...
Motivation:
Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant calling results depends not only on the quality of read alignment and variant calling software but also on the interaction between these complex software tools.
Results:
In this review, we evaluate short-read aligner p...
Antipsychotic drugs are the current first-line of treatment for schizophrenia and other psychotic conditions. However, their molecular effects on the human brain are poorly studied, due to difficulty of tissue access and confounders associated with disease status. Here we examine differences in gene expression and DNA methylation associated with po...
Over the past decade, short-read sequence alignment has become a mature technology. Optimized algorithms, careful software engineering, and high-speed hardware have contributed to greatly increased throughput and accuracy. With these improvements, many opportunities for performance optimization have emerged.
In this review, we examine three general...
Antipsychotic drugs are the current first-line of treatment for schizophrenia and other psychotic conditions. However, their molecular effects on the human brain are poorly studied, due to difficulty of tissue access and confounders associated with disease status. Here we examine differences in gene expression and DNA methylation associated with po...
DNA methylation (DNAm) is an epigenetic regulator of gene expression and a hallmark of gene-environment interaction. Using whole-genome bisulfite sequencing, we have surveyed DNAm in 344 samples of human postmortem brain tissue from neurotypical subjects and individuals with schizophrenia. We identify genetic influence on local methylation levels t...
We discuss the history and lessons learned from a series of deployments of environmental sensors measuring soil parameters and CO2 fluxes over the last fifteen years, in an outdoor environment. We present the hardware and software architecture of our current Gen-3 system, and then discuss how we are simplifying the user facing part of the software,...
We propose a new scheme to reconstruct the baryon acoustic oscillations (BAO) signal, which contains key cosmological information, based on deep convolutional neural networks (CNN). Trained with almost no fine tuning, the network can recover large-scale modes accurately in the test set: the correlation coefficient between the true and reconstructed...
Running machine learning analytics over geographically distributed datasets is a rapidly arising problem in the world of data management policies ensuring privacy and data security. Visualizing high dimensional data using tools such as t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP) became...
In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Ario...
DNA methylation (DNAm) regulates gene expression and may represent gene-environment interactions. Using whole genome bisulfite sequencing, we surveyed DNAm in a large sample (n=344) of human brain tissues. We identify widespread genetic influence on local methylation levels throughout the genome, with 76% of SNPs and 38% of CpGs being part of methy...
We present SciServer, a science platform built and supported by the Institute for Data Intensive Engineering and Science at the Johns Hopkins University. SciServer builds upon and extends the SkyServer system of server-side tools that introduced the astronomical community to SQL (Structured Query Language) and has been serving the Sloan Digital Sky...
DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited. We therefore assessed genomic methylation at over 39 million sites in the prenatal cortex using whole geno...
The standard explanation for galaxy spin starts with the tidal-torque theory (TTT), in which an ellipsoidal dark-matter protohalo, which comes to host the galaxy, is torqued up by the tidal gravitational field around it. We discuss a complementary picture, using the relatively familiar velocity field inside the protohalo, instead of the tidal field...
We present SciServer, a science platform built and supported by the Institute for Data Intensive Engineering and Science at the Johns Hopkins University. SciServer builds upon and extends the SkyServer system of server-side tools that introduced the astronomical community to SQL (Structured Query Language) and has been serving the Sloan Digital Sky...
Background: DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited.
Results: We therefore assessed genomic methylation at over 39 million sites in the prenatal c...
Cosmological N-body simulations are crucial for understanding how the Universe evolves. Studying large-scale distributions of matter in these simulations and comparing them to observations usually involves detecting dense clusters of particles called "halos,'' which are gravitationally bound and expected to form galaxies. However, traditional clust...
DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited. We therefore assessed genomic methylation at over 39 million sites in the prenatal cortex using whole geno...
We present the multi-GPU realization of the StePS(Stereographically Projected Cosmological Simulations) algorithm with MPI–OpenMP–CUDA hybrid parallelization and nearly ideal scale-out to multiple compute nodes. Our new zoom-in cosmological direct N-body simulation method simulates the infinite universe with unprecedented dynamic range for a given...
A Kavli foundation sponsored workshop on the theme \emph{Petabytes to Science} was held 12$^{th}$ to 14$^{th}$ of February 2019 in Las Vegas. The aim of the this workshop was to discuss important trends and technologies which may support astronomy. We also tackled how to better shape the workforce for the new trends and how we should approach educa...
We reexamine how angular momentum arises in dark-matter haloes. The standard tidal-torque theory (TTT), in which an ellipsoidal protohalo is torqued up by the tidal field, is an approximation to a mechanism which is more accurate, and that we find to be more pedagogically appealing. In the initial conditions, within a collapsing protohalo, there is...
The Wide Field Infrared Survey Telescope (WFIRST) is a 2.4m space telescope with a 0.281 deg^2 field of view for near-IR imaging and slitless spectroscopy and a coronagraph designed for > 10^8 starlight suppresion. As background information for Astro2020 white papers, this article summarizes the current design and anticipated performance of WFIRST....
Scientists deploy environmental monitoring net-works to discover previously unobservable phenomena and quantify subtle spatial and temporal differences in the physical quantities they measure. Our experience, shared by others, has shown that measurements gathered by such networks are perturbed by sensor faults. In response, multiple fault detection...
Harsh deployment environments and uncertain run-time conditions create numerous challenges for postmortem time reconstruction methods. For example, motes often reboot and thus lose their clock state, considering that the majority of mote platforms lack a real-time clock. While existing time reconstruction methods for long-term data gathering networ...
This paper investigates postmortem timestamp reconstruction in environmental monitoring networks. In the absence of a time-synchronization protocol, these networks use multiple pairs of (local, global) timestamps to retroactively estimate the motes' clock drift and offset and thus reconstruct the measurement time series. We present Sundial, a novel...
We present the multi-GPU realization of the StePS (Stereographically Projected Cosmological Simulations) algorithm with MPI-OpenMP-CUDA hybrid parallelization, and show what parallelization efficiency can be reached. We use a new zoom-in cosmological direct N-body simulation method, that can simulate the infinite universe with unprecedented dynamic...
Working with Jim Gray, we set out more than 20 years ago to design and build the archive for the Sloan Digital Sky Survey (SDSS), the SkyServer. The SDSS project collected a huge data set over a large fraction of the Northern Sky and turned it into an open resource for the world’s astronomy community. Over the years the project has changed astronom...
Motivation:
DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using re...
Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of...
Direct numerical simulation (DNS) of a transitional boundary layer. Data are stored on 3320 × 224 × 2048 grid points. The full time evolution is available, over about 1 flow-through time across the length of simulation domain.
Data science promises new insights, helping transform information into knowledge that can drive science and industry.
Motivation
The alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can be mitigated by appropriate algorithmic and software-enginee...
Twenty years ago, work commenced on the Sloan Digital Sky Survey. The project aimed to collect a statistically complete dataset over a large fraction of the sky and turn it into an open data resource for the world’s astronomy community. There were few examples to learn from, and those of us who worked on it had to invent much of the system ourselve...
Although the “big data” revolution first came to public prominence (circa 2010) in online enterprises like Google, Amazon, and Facebook, it is now widely recognized as the initial phase of a watershed transformation that modern society generally—and scientific and engineering research in particular—are in the process of undergoing. Responding to th...
Observations of large scale structure indicate the presence of very large, wall-like superclusters. In their distributions there is a growing evidence, that there is a characteristic scale in excess of 100 h ^-1 Mpc. Other observations seem to imply only Gaussian fluctuations in the power spectrum. We describe how this apparent conflict can caused...
Cosmological $N$-body simulations play a vital role in studying how the Universe evolves. To compare to observations and make scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the...
Cosmological $N$-body simulations play a vital role in studying models for the evolution of the Universe. To compare to observations and make a scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do...
The alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algori...
Motivated by interest in the geometry of high intensity events of turbulent flows, we examine spatial correlation functions of sets where turbulent events are particularly intense. These sets are defined using indicator functions on excursion and iso-value sets. Their geometric scaling properties are analyzed by examining possible power-law decay o...
Motivated by interest in the geometry of high intensity events of turbulent flows, we examine spatial correlation functions of sets where turbulent events are particularly intense. These sets are defined using indicator functions on excursion and iso-value sets. Their geometric scaling properties are analyzed by examining possible power-law decay o...
Numerical simulations present challenges because they generate petabyte-scale data that must be extracted and reduced during the simulation. We demonstrate a seamless integration of feature extraction for a simulation of turbulent fluid dynamics. The simulation produces on the order of 6 TB per timestep. In order to analyze and store this data, we...
Time synchronization is an essential service in many sensor network applications. Several time synchronization protocols have been proposed to synchronize the clocks on sensor nodes. All the synchronization protocols bring energy and complexity overhead to the network. There are many WSN applications which timestamp their samples with second level...
Pan-STARRS1 has carried out a distinct set of imaging synoptic sky surveys including the 3{\pi} Steradian Survey and Medium Deep Survey in 5 bands (grizy). This paper describes the database and data products from the Pan-STARRS1 Surveys. The data products include images and an SQL-based relational database available from the STScI MAST. The databas...
This paper describes the organization of the database and the catalog data products from the Pan-STARRS1 $3\pi$ Steradian Survey. The catalog data products are available in the form of an SQL-based relational database from MAST, the Mikulski Archive for Space Telescopes at STScI. The database is described in detail, including the construction of th...
Observational astronomy in the time-domain era faces several new challenges. One of them is the efficient use of observations obtained at multiple epochs. The work presented here addresses faint object detection with multi-epoch data, and describes an incremental strategy for separating real objects from artifacts in ongoing surveys, in situations...
Observational astronomy in the time-domain era faces several new challenges. One of them is the efficient use of observations obtained at multiple epochs. The work presented here addresses faint object detection with multi-epoch data, and describes an incremental strategy for separating real objects from artifacts in ongoing surveys, in situations...
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational ca...
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational ca...
Accurate measures of galactic overdensities are invaluable for precision cosmology. Obtaining these measurements is complicated when members of one's galaxy sample lack radial depths, most commonly derived via spectroscopic redshifts. In this paper, we utilize the Sloan Digital Sky Survey's Main Galaxy Sample to compare seven methods of counting ga...
The most common statistic used to analyze large-scale structure surveys is the correlation function, or power spectrum. Here, we show how `slicing' the correlation function on local density brings sensitivity to interesting non-Gaussian features in the large-scale structure, such as the expansion or contraction of baryon acoustic oscillations (BAO)...
The most common statistic used to analyze large-scale structure surveys is the correlation function, or power spectrum. Here, we show how `slicing' the correlation function on local density brings sensitivity to interesting non-Gaussian features in the large-scale structure, such as the expansion or contraction of baryon acoustic oscillations (BAO)...
In view of future high precision large scale structure surveys, it is important to quantify percent and sub-percent level effects in cosmological $N$-body simulations from which theoretical predictions are drawn. One such effect involves the choice of whether to set all modes above the one-dimensional Nyquist frequency, the so-called "corner" modes...
In view of future high-precision large-scale structure surveys, it is important to quantify the percent and subpercent level effects in cosmological $N$-body simulations from which theoretical predictions are drawn. One such effect involves deciding whether to zero all modes above the one-dimensional Nyquist frequency, the so-called "corner" modes,...
Extremely large (peta-scale) data collections are generally partitioned into millions of containers (disks/volumes/files) which are essentially unmovable due to their aggregate size. They are stored over a large distributed cloud of machines, with computing co-located with the data. Given this data layout, even simple tasks are difficult to perform...
As numerical ocean simulations become more realistic, analysis of their output is increasingly time consuming. As part of a larger effort to make ocean model output more accessible for analysis, we have developed a fast particle-tracking algorithm for exploring the kinematics of the simulated flow. The algorithm is independent of operating system a...
The Wide Field Infrared Survey Telescope (WFIRST) is a 2.4 m telescope with a large field of view ( ~ 0.3 deg ² ) and fine angular resolution (0.11”). WFIRST’s Wide Field Instrument (WFI) will obtain images in the Z, Y, J, H, F184, W149 (wide) filter bands, and grism spectra of the same large field of view. The data volume of the WFIRST Science Arc...
We present estimates of cosmological parameters from the application of the Karhunen-Loève transform to the analysis of the 3D power spectrum of density fluctuations using Sloan Digital Sky Survey galaxy redshifts. We use Ω m h and f b = Ω b /Ω m to describe the shape of the power spectrum, σ L 8 g for the (linearly extrapolated) normalization, and...
We present the methodology and data behind the photometric redshift database of the Sloan Digital Sky Survey Data Release
12 (SDSS DR12). We adopt a hybrid technique, empirically estimating the redshift via local regression on a spectroscopic training
set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS sp...
We present the methodology and data behind the photometric redshift database of the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). We adopt a hybrid technique, empirically estimating the redshift via local regression on a spectroscopic training set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS sp...
We identify and correct numerous errors within the photometric and spectroscopic footprints (SFs) of the Sloan Digital Sky Survey's (SDSS) 6th data release (DR6). Within the SDSS's boundaries hundreds of millions of objects have been detected. Yet we present evidence that the boundaries themselves contain a significant number of mistakes that are b...
We study the parity-odd part (that we shall call Doppler term) of the linear galaxy two-point correlation function that arises from wide-angle, velocity, Doppler lensing and cosmic acceleration effects. As it is important at low redshift and at large angular separations, the Doppler term is usually neglected in the current generation of galaxy surv...
We study the parity-odd part (that we shall call Doppler term) of the linear galaxy two-point correlation function that arises from wide-angle, velocity, Doppler lensing and cosmic acceleration effects. As it is important at low redshift and at large angular separations, the Doppler term is usually neglected in the current generation of galaxy surv...
Many eigensolvers such as ARPACK and Anasazi have been developed to compute eigenvalues of a large sparse matrix. These eigensolvers are limited by the capacity of RAM. They run in memory of a single machine for smaller eigenvalue problems and require the distributed memory for larger problems. In contrast, we develop an SSD-based eigensolver frame...
The output from a direct numerical simulation (DNS) of turbulent channel flow at Reτ ≈ 1000 is used to construct a publicly and Web services accessible, spatio-temporal database for this flow. The simulated channel has a size of 8πh × 2h × 3πh, where h is the channel half-height. Data are stored at 2048 × 512 × 1536 spatial grid points for a total...
We simulate the tidal disruption of a collisionless N-body globular star
cluster in a total of 300 different orbits selected to have galactocentric
radii between 10 and 30 kpc in four dark matter halos: (a) a spherical halo
with no subhalos, (b) a spherical halo with subhalos, (c) a realistic halo with
no subhalos, and (d) a realistic halo with sub...
We simulate the tidal disruption of a collisionless N-body globular star cluster in a total of 300 different orbits selected to have galactocentric radii between 10 and 30 kpc in four dark matter halos: (a) a spherical halo with no subhalos, (b) a spherical halo with subhalos, (c) a realistic halo with no subhalos, and (d) a realistic halo with sub...
The Lagrangian dynamics of a single fluid element within a self-gravitational
matter field is intrinsically non-local due to the presence of the tidal force.
This complicates the theoretical investigation of the non-linear evolution of
various cosmic objects, e.g. dark matter halos, in the context of Lagrangian
fluid dynamics, since a fluid parcel...
We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical prop...
We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical prop...
We show that redshift-space distortions of galaxy correlations have a strong effect on correlation functions with the signature of the Baryon Acoustic Oscillations (BAO). Near the line of sight, the features become sharper as a result of redshift-space distortions. We analyze the SDSS DR7 main-galaxy sample (MGS), splitting the sample into slices 2...
Cosmological N-body simulations are essential for studies of the large-scale distribution of matter and galaxies in the Universe. This analysis often involves finding clusters of particles and retrieving their properties. Detecting such "halos" among a very large set of particles is a computationally intensive problem, usually executed on the same...
Accurate modelling of non-linearities in the galaxy bispectrum, the Fourier transform of the galaxy three-point correlation
function, is essential to fully exploit it as a cosmological probe. In this paper, we present numerical and theoretical challenges
in modelling the non-linear bispectrum. First, we test the robustness of the matter bispectrum...
Solid state disks (SSDs) have advanced to outperform traditional hard drives
significantly in both random reads and writes. However, heavy random writes
trigger fre- quent garbage collection and decrease the performance of SSDs. In
an SSD array, garbage collection of individ- ual SSDs is not synchronized,
leading to underutilization of some of the...
When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU) hardware.We...
Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, uti...
High Performance Computing is becoming an instrument in its own right. The largest simulations performed on our supercomputers are now approaching petabytes. As the volume of these simulations is growing, it is becoming harder to access, analyze and visualize these data.at the same time for a broad community buy in we need to provide public access...
Computerized decision making is becoming a reality with exponentially growing data and machine capabilities. Some decision making is extremely complex, historically reserved for governing bodies or market places where the collective human experience and intelligence come to play. Other decision making can be trusted to computers that are on a path...