Alexander S. Szalay

Alexander S. Szalay
Johns Hopkins University | JHU · Department of Physics and Astronomy

PhD

About

891
Publications
118,300
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
112,915
Citations
Additional affiliations
January 1987 - present
Johns Hopkins University
August 1984 - January 1986
University of Chicago
January 1981 - June 1982
Education
September 1972 - June 1975
Eötvös Loránd University
Field of study
  • Astrophysics

Publications

Publications (891)
Article
We present a scheme based on artificial neural networks (ANNs) to estimate the line-of-sight velocities of individual galaxies from an observed redshift–space galaxy distribution. We find an estimate of the peculiar velocity at a galaxy based on galaxy counts and barycentres in shells around it. By training the network with environmental characteri...
Article
Full-text available
Motivation: Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant calling results depends not only on the quality of read alignment and variant calling software but also on the interaction between these complex software tools. Results: In this review, we evaluate short-read aligner p...
Article
Full-text available
Antipsychotic drugs are the current first-line of treatment for schizophrenia and other psychotic conditions. However, their molecular effects on the human brain are poorly studied, due to difficulty of tissue access and confounders associated with disease status. Here we examine differences in gene expression and DNA methylation associated with po...
Article
Over the past decade, short-read sequence alignment has become a mature technology. Optimized algorithms, careful software engineering, and high-speed hardware have contributed to greatly increased throughput and accuracy. With these improvements, many opportunities for performance optimization have emerged. In this review, we examine three general...
Preprint
Full-text available
Antipsychotic drugs are the current first-line of treatment for schizophrenia and other psychotic conditions. However, their molecular effects on the human brain are poorly studied, due to difficulty of tissue access and confounders associated with disease status. Here we examine differences in gene expression and DNA methylation associated with po...
Article
Full-text available
DNA methylation (DNAm) is an epigenetic regulator of gene expression and a hallmark of gene-environment interaction. Using whole-genome bisulfite sequencing, we have surveyed DNAm in 344 samples of human postmortem brain tissue from neurotypical subjects and individuals with schizophrenia. We identify genetic influence on local methylation levels t...
Preprint
Full-text available
We discuss the history and lessons learned from a series of deployments of environmental sensors measuring soil parameters and CO2 fluxes over the last fifteen years, in an outdoor environment. We present the hardware and software architecture of our current Gen-3 system, and then discuss how we are simplifying the user facing part of the software,...
Article
We propose a new scheme to reconstruct the baryon acoustic oscillations (BAO) signal, which contains key cosmological information, based on deep convolutional neural networks (CNN). Trained with almost no fine tuning, the network can recover large-scale modes accurately in the test set: the correlation coefficient between the true and reconstructed...
Preprint
Running machine learning analytics over geographically distributed datasets is a rapidly arising problem in the world of data management policies ensuring privacy and data security. Visualizing high dimensional data using tools such as t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP) became...
Article
Full-text available
In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Ario...
Preprint
Full-text available
DNA methylation (DNAm) regulates gene expression and may represent gene-environment interactions. Using whole genome bisulfite sequencing, we surveyed DNAm in a large sample (n=344) of human brain tissues. We identify widespread genetic influence on local methylation levels throughout the genome, with 76% of SNPs and 38% of CpGs being part of methy...
Article
Full-text available
We present SciServer, a science platform built and supported by the Institute for Data Intensive Engineering and Science at the Johns Hopkins University. SciServer builds upon and extends the SkyServer system of server-side tools that introduced the astronomical community to SQL (Structured Query Language) and has been serving the Sloan Digital Sky...
Article
DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited. We therefore assessed genomic methylation at over 39 million sites in the prenatal cortex using whole geno...
Article
Full-text available
The standard explanation for galaxy spin starts with the tidal-torque theory (TTT), in which an ellipsoidal dark-matter protohalo, which comes to host the galaxy, is torqued up by the tidal gravitational field around it. We discuss a complementary picture, using the relatively familiar velocity field inside the protohalo, instead of the tidal field...
Preprint
We present SciServer, a science platform built and supported by the Institute for Data Intensive Engineering and Science at the Johns Hopkins University. SciServer builds upon and extends the SkyServer system of server-side tools that introduced the astronomical community to SQL (Structured Query Language) and has been serving the Sloan Digital Sky...
Preprint
Full-text available
Background: DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited. Results: We therefore assessed genomic methylation at over 39 million sites in the prenatal c...
Preprint
Cosmological N-body simulations are crucial for understanding how the Universe evolves. Studying large-scale distributions of matter in these simulations and comparing them to observations usually involves detecting dense clusters of particles called "halos,'' which are gravitationally bound and expected to form galaxies. However, traditional clust...
Preprint
Full-text available
DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited. We therefore assessed genomic methylation at over 39 million sites in the prenatal cortex using whole geno...
Article
We present the multi-GPU realization of the StePS(Stereographically Projected Cosmological Simulations) algorithm with MPI–OpenMP–CUDA hybrid parallelization and nearly ideal scale-out to multiple compute nodes. Our new zoom-in cosmological direct N-body simulation method simulates the infinite universe with unprecedented dynamic range for a given...
Preprint
Full-text available
A Kavli foundation sponsored workshop on the theme \emph{Petabytes to Science} was held 12$^{th}$ to 14$^{th}$ of February 2019 in Las Vegas. The aim of the this workshop was to discuss important trends and technologies which may support astronomy. We also tackled how to better shape the workforce for the new trends and how we should approach educa...
Preprint
Full-text available
We reexamine how angular momentum arises in dark-matter haloes. The standard tidal-torque theory (TTT), in which an ellipsoidal protohalo is torqued up by the tidal field, is an approximation to a mechanism which is more accurate, and that we find to be more pedagogically appealing. In the initial conditions, within a collapsing protohalo, there is...
Preprint
Full-text available
The Wide Field Infrared Survey Telescope (WFIRST) is a 2.4m space telescope with a 0.281 deg^2 field of view for near-IR imaging and slitless spectroscopy and a coronagraph designed for > 10^8 starlight suppresion. As background information for Astro2020 white papers, this article summarizes the current design and anticipated performance of WFIRST....
Preprint
Scientists deploy environmental monitoring net-works to discover previously unobservable phenomena and quantify subtle spatial and temporal differences in the physical quantities they measure. Our experience, shared by others, has shown that measurements gathered by such networks are perturbed by sensor faults. In response, multiple fault detection...
Preprint
Harsh deployment environments and uncertain run-time conditions create numerous challenges for postmortem time reconstruction methods. For example, motes often reboot and thus lose their clock state, considering that the majority of mote platforms lack a real-time clock. While existing time reconstruction methods for long-term data gathering networ...
Preprint
This paper investigates postmortem timestamp reconstruction in environmental monitoring networks. In the absence of a time-synchronization protocol, these networks use multiple pairs of (local, global) timestamps to retroactively estimate the motes' clock drift and offset and thus reconstruct the measurement time series. We present Sundial, a novel...
Preprint
We present the multi-GPU realization of the StePS (Stereographically Projected Cosmological Simulations) algorithm with MPI-OpenMP-CUDA hybrid parallelization, and show what parallelization efficiency can be reached. We use a new zoom-in cosmological direct N-body simulation method, that can simulate the infinite universe with unprecedented dynamic...
Chapter
Working with Jim Gray, we set out more than 20 years ago to design and build the archive for the Sloan Digital Sky Survey (SDSS), the SkyServer. The SDSS project collected a huge data set over a large fraction of the Northern Sky and turned it into an open resource for the world’s astronomy community. Over the years the project has changed astronom...
Article
Motivation: DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using re...
Article
Full-text available
Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of...
Research
Full-text available
Direct numerical simulation (DNS) of a transitional boundary layer. Data are stored on 3320 × 224 × 2048 grid points. The full time evolution is available, over about 1 flow-through time across the length of simulation domain.
Article
Data science promises new insights, helping transform information into knowledge that can drive science and industry.
Article
Motivation The alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can be mitigated by appropriate algorithmic and software-enginee...
Article
Twenty years ago, work commenced on the Sloan Digital Sky Survey. The project aimed to collect a statistically complete dataset over a large fraction of the sky and turn it into an open data resource for the world’s astronomy community. There were few examples to learn from, and those of us who worked on it had to invent much of the system ourselve...
Preprint
Full-text available
Although the “big data” revolution first came to public prominence (circa 2010) in online enterprises like Google, Amazon, and Facebook, it is now widely recognized as the initial phase of a watershed transformation that modern society generally—and scientific and engineering research in particular—are in the process of undergoing. Responding to th...
Conference Paper
Full-text available
Observations of large scale structure indicate the presence of very large, wall-like superclusters. In their distributions there is a growing evidence, that there is a characteristic scale in excess of 100 h ^-1 Mpc. Other observations seem to imply only Gaussian fluctuations in the power spectrum. We describe how this apparent conflict can caused...
Article
Cosmological $N$-body simulations play a vital role in studying how the Universe evolves. To compare to observations and make scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the...
Preprint
Cosmological $N$-body simulations play a vital role in studying models for the evolution of the Universe. To compare to observations and make a scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do...
Preprint
Full-text available
The alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algori...
Article
Full-text available
Motivated by interest in the geometry of high intensity events of turbulent flows, we examine spatial correlation functions of sets where turbulent events are particularly intense. These sets are defined using indicator functions on excursion and iso-value sets. Their geometric scaling properties are analyzed by examining possible power-law decay o...
Preprint
Motivated by interest in the geometry of high intensity events of turbulent flows, we examine spatial correlation functions of sets where turbulent events are particularly intense. These sets are defined using indicator functions on excursion and iso-value sets. Their geometric scaling properties are analyzed by examining possible power-law decay o...
Conference Paper
Full-text available
Numerical simulations present challenges because they generate petabyte-scale data that must be extracted and reduced during the simulation. We demonstrate a seamless integration of feature extraction for a simulation of turbulent fluid dynamics. The simulation produces on the order of 6 TB per timestep. In order to analyze and store this data, we...
Conference Paper
Time synchronization is an essential service in many sensor network applications. Several time synchronization protocols have been proposed to synchronize the clocks on sensor nodes. All the synchronization protocols bring energy and complexity overhead to the network. There are many WSN applications which timestamp their samples with second level...
Article
Full-text available
Pan-STARRS1 has carried out a distinct set of imaging synoptic sky surveys including the 3{\pi} Steradian Survey and Medium Deep Survey in 5 bands (grizy). This paper describes the database and data products from the Pan-STARRS1 Surveys. The data products include images and an SQL-based relational database available from the STScI MAST. The databas...
Preprint
This paper describes the organization of the database and the catalog data products from the Pan-STARRS1 $3\pi$ Steradian Survey. The catalog data products are available in the form of an SQL-based relational database from MAST, the Mikulski Archive for Space Telescopes at STScI. The database is described in detail, including the construction of th...
Article
Observational astronomy in the time-domain era faces several new challenges. One of them is the efficient use of observations obtained at multiple epochs. The work presented here addresses faint object detection with multi-epoch data, and describes an incremental strategy for separating real objects from artifacts in ongoing surveys, in situations...
Preprint
Observational astronomy in the time-domain era faces several new challenges. One of them is the efficient use of observations obtained at multiple epochs. The work presented here addresses faint object detection with multi-epoch data, and describes an incremental strategy for separating real objects from artifacts in ongoing surveys, in situations...
Article
Full-text available
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational ca...
Preprint
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational ca...
Article
Accurate measures of galactic overdensities are invaluable for precision cosmology. Obtaining these measurements is complicated when members of one's galaxy sample lack radial depths, most commonly derived via spectroscopic redshifts. In this paper, we utilize the Sloan Digital Sky Survey's Main Galaxy Sample to compare seven methods of counting ga...
Article
The most common statistic used to analyze large-scale structure surveys is the correlation function, or power spectrum. Here, we show how `slicing' the correlation function on local density brings sensitivity to interesting non-Gaussian features in the large-scale structure, such as the expansion or contraction of baryon acoustic oscillations (BAO)...
Preprint
The most common statistic used to analyze large-scale structure surveys is the correlation function, or power spectrum. Here, we show how `slicing' the correlation function on local density brings sensitivity to interesting non-Gaussian features in the large-scale structure, such as the expansion or contraction of baryon acoustic oscillations (BAO)...
Article
In view of future high precision large scale structure surveys, it is important to quantify percent and sub-percent level effects in cosmological $N$-body simulations from which theoretical predictions are drawn. One such effect involves the choice of whether to set all modes above the one-dimensional Nyquist frequency, the so-called "corner" modes...
Preprint
In view of future high-precision large-scale structure surveys, it is important to quantify the percent and subpercent level effects in cosmological $N$-body simulations from which theoretical predictions are drawn. One such effect involves deciding whether to zero all modes above the one-dimensional Nyquist frequency, the so-called "corner" modes,...
Conference Paper
Extremely large (peta-scale) data collections are generally partitioned into millions of containers (disks/volumes/files) which are essentially unmovable due to their aggregate size. They are stored over a large distributed cloud of machines, with computing co-located with the data. Given this data layout, even simple tasks are difficult to perform...
Conference Paper
Full-text available
As numerical ocean simulations become more realistic, analysis of their output is increasingly time consuming. As part of a larger effort to make ocean model output more accessible for analysis, we have developed a fast particle-tracking algorithm for exploring the kinematics of the simulated flow. The algorithm is independent of operating system a...
Article
The Wide Field Infrared Survey Telescope (WFIRST) is a 2.4 m telescope with a large field of view ( ~ 0.3 deg ² ) and fine angular resolution (0.11”). WFIRST’s Wide Field Instrument (WFI) will obtain images in the Z, Y, J, H, F184, W149 (wide) filter bands, and grism spectra of the same large field of view. The data volume of the WFIRST Science Arc...
Article
We present estimates of cosmological parameters from the application of the Karhunen-Loève transform to the analysis of the 3D power spectrum of density fluctuations using Sloan Digital Sky Survey galaxy redshifts. We use Ω m h and f b = Ω b /Ω m to describe the shape of the power spectrum, σ L 8 g for the (linearly extrapolated) normalization, and...
Article
We present the methodology and data behind the photometric redshift database of the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). We adopt a hybrid technique, empirically estimating the redshift via local regression on a spectroscopic training set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS sp...
Preprint
We present the methodology and data behind the photometric redshift database of the Sloan Digital Sky Survey Data Release 12 (SDSS DR12). We adopt a hybrid technique, empirically estimating the redshift via local regression on a spectroscopic training set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS sp...
Article
We identify and correct numerous errors within the photometric and spectroscopic footprints (SFs) of the Sloan Digital Sky Survey's (SDSS) 6th data release (DR6). Within the SDSS's boundaries hundreds of millions of objects have been detected. Yet we present evidence that the boundaries themselves contain a significant number of mistakes that are b...
Article
Full-text available
We study the parity-odd part (that we shall call Doppler term) of the linear galaxy two-point correlation function that arises from wide-angle, velocity, Doppler lensing and cosmic acceleration effects. As it is important at low redshift and at large angular separations, the Doppler term is usually neglected in the current generation of galaxy surv...
Preprint
We study the parity-odd part (that we shall call Doppler term) of the linear galaxy two-point correlation function that arises from wide-angle, velocity, Doppler lensing and cosmic acceleration effects. As it is important at low redshift and at large angular separations, the Doppler term is usually neglected in the current generation of galaxy surv...
Article
Full-text available
Many eigensolvers such as ARPACK and Anasazi have been developed to compute eigenvalues of a large sparse matrix. These eigensolvers are limited by the capacity of RAM. They run in memory of a single machine for smaller eigenvalue problems and require the distributed memory for larger problems. In contrast, we develop an SSD-based eigensolver frame...
Article
The output from a direct numerical simulation (DNS) of turbulent channel flow at Reτ ≈ 1000 is used to construct a publicly and Web services accessible, spatio-temporal database for this flow. The simulated channel has a size of 8πh × 2h × 3πh, where h is the channel half-height. Data are stored at 2048 × 512 × 1536 spatial grid points for a total...
Article
We simulate the tidal disruption of a collisionless N-body globular star cluster in a total of 300 different orbits selected to have galactocentric radii between 10 and 30 kpc in four dark matter halos: (a) a spherical halo with no subhalos, (b) a spherical halo with subhalos, (c) a realistic halo with no subhalos, and (d) a realistic halo with sub...
Preprint
We simulate the tidal disruption of a collisionless N-body globular star cluster in a total of 300 different orbits selected to have galactocentric radii between 10 and 30 kpc in four dark matter halos: (a) a spherical halo with no subhalos, (b) a spherical halo with subhalos, (c) a realistic halo with no subhalos, and (d) a realistic halo with sub...
Article
The Lagrangian dynamics of a single fluid element within a self-gravitational matter field is intrinsically non-local due to the presence of the tidal force. This complicates the theoretical investigation of the non-linear evolution of various cosmic objects, e.g. dark matter halos, in the context of Lagrangian fluid dynamics, since a fluid parcel...
Article
We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical prop...
Preprint
We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical prop...
Article
We show that redshift-space distortions of galaxy correlations have a strong effect on correlation functions with the signature of the Baryon Acoustic Oscillations (BAO). Near the line of sight, the features become sharper as a result of redshift-space distortions. We analyze the SDSS DR7 main-galaxy sample (MGS), splitting the sample into slices 2...
Conference Paper
Cosmological N-body simulations are essential for studies of the large-scale distribution of matter and galaxies in the Universe. This analysis often involves finding clusters of particles and retrieving their properties. Detecting such "halos" among a very large set of particles is a computationally intensive problem, usually executed on the same...
Article
Accurate modelling of non-linearities in the galaxy bispectrum, the Fourier transform of the galaxy three-point correlation function, is essential to fully exploit it as a cosmological probe. In this paper, we present numerical and theoretical challenges in modelling the non-linear bispectrum. First, we test the robustness of the matter bispectrum...
Article
Full-text available
Solid state disks (SSDs) have advanced to outperform traditional hard drives significantly in both random reads and writes. However, heavy random writes trigger fre- quent garbage collection and decrease the performance of SSDs. In an SSD array, garbage collection of individ- ual SSDs is not synchronized, leading to underutilization of some of the...
Article
Full-text available
When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU) hardware.We...
Conference Paper
Full-text available
Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, uti...
Article
High Performance Computing is becoming an instrument in its own right. The largest simulations performed on our supercomputers are now approaching petabytes. As the volume of these simulations is growing, it is becoming harder to access, analyze and visualize these data.at the same time for a broad community buy in we need to provide public access...
Article
Full-text available
Computerized decision making is becoming a reality with exponentially growing data and machine capabilities. Some decision making is extremely complex, historically reserved for governing bodies or market places where the collective human experience and intelligence come to play. Other decision making can be trusted to computers that are on a path...