Article

The Design and implementation of FFTW3

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The first, lofar_udp_extractor, offers an easy interface for accessing the normal functionality of the library for observers, while the second, lofar_stokes_extractor, utilises the output voltages to perform additional processing for science-ready outputs, such as channelisation of Stokes parameters through the FFTW library (Frigo and Johnson 2005) and additional temporal downsampling beyond the library's factor of 16 limit. Both of these interfaces act further act as examples of how to further utilise the library and its outputs. ...
Preprint
Full-text available
International LOFAR stations are powerful radio telescopes, however they are delivered without the tooling necessary to convert their raw data stream into standard data formats that can be used by common processing pipelines, or science-ready data products. udpPacketManager is a C and C++ library that was developed with the intent of providing a faster-than-realtime software package for converting raw data into arbitrary data formats based on the needs of observers working with the Irish LOFAR station (I-LOFAR), and stations across Europe. It currently offers an open-source solution for both offline and online (pre-)processing of telescope data into a wide variety of formats.
... where t total is the length of the potential time series, and δV (f ) is the Fourier transform of the electric potential time series. The Fourier transform is computed using the C subroutine library FFTW [42]. At a given temperature, the time series V i (t) is split into ten segments of equal length. ...
Preprint
Full-text available
Charge noise in silicon quantum dots has been observed to have a 1/f spectrum. We propose a model in which a pair of quantum dots are coupled to a 2D bath of fluctuating two level systems (TLS) that have electric dipole moments and that interact with each other, i.e., with the other fluctuators. These interactions are primarily via the elastic strain field. We use a 2D nearest-neighbor Ising spin glass to represent these elastic interactions and to simulate the dynamics of the bath of electric dipole fluctuators in the presence of a ground plane representing metal gates above the oxide layer containing the fluctuators. The interactions between the TLS cause the energy splitting of individual fluctuators to change with time. We calculate the resulting fluctuations in the electric potential at the two quantum dots that lie below the oxide layer. We find that 1/f electric potential noise spectra at the quantum dots and cross correlation in the noise between the two quantum dots are in qualitative agreement with experiment. Our simulations find that the cross correlations decrease exponentially with increasing quantum dot separation.
... Periodic boundary conditions are imposed so that the convolutions involved in the DFT mean field ℋ[ρ]can be carried out using the fast Fourier transform. 43 The differential operators in ℋ[ρ] are approximated by 13-point formulas. All the simulation grids had 288 points equally spaced by 0.4 Å in each direction, except for the head-on collision for which the grid had to be extended along the incidence axis (z) to 576 points. ...
Article
We address the collision of two superfluid 4He droplets at non-zero initial relative velocities and impact parameters within the framework of liquid 4He time-dependent density functional theory at zero temperature. Despite the small size of these droplets (1000 He atoms in the merged droplet) imposed by computational limitations, we have found that quantized vortices may be readily nucleated for reasonable collision parameters. At variance with head-on collisions, where only vortex rings are produced, collisions with a non-zero impact parameter produce linear vortices that are nucleated at indentations appearing on the surface of the deformed merged droplet. Whereas for equal-size droplets, vortices are produced in pairs, an odd number of vortices can appear when the colliding droplet sizes are different. In all cases, vortices coexist with surface capillary waves. The possibility for collisions to be at the origin of vortex nucleation in experiments involving very large droplets is discussed. An additional surprising result is the observation of the drops coalescence even for grazing and distal collisions at relative velocities as high as 80 and 40 m/s, respectively, induced by the long-range van der Waals attraction between the droplets.
... 4-6 in [8]) depending whether the number N can be represented as a product of integer (non negative) powers of the first prime numbers. Specifically, this is a quote [5] from the documentation of the FFTW [4], which is the de facto standard in pseudo spectral [3] methods: "The input data can have arbitrary length. FFTW employs O(n log n) algorithms for all lengths, including prime numbers." ...
Preprint
The problem of optimization of the array size for modern discrete Fourier transform libraries is considered and reformulated as an integer linear programming problem. Acceleration of finding an optimal solution using standard freely available library with respect to brute force approach is demonstrated. Ad hoc recursive algorithm of finding the optimal solution is proposed, complexity scaling of the algorithm is estimated analytically. The problem can be used in a linear programming class as an example of purely integer programming problem (continuous linear programming solution has no sense), simple enough to be solved using even interpreting programming languages like Python or Matlab.
... S.D. thanks the Department of Applied Mathematics at the U. of Washington for hospitality. A.S. and S.D. acknowledge the FFTW project and its authors [71] as well as the entire GNU Project. A.S. thanks the Institute for Computational and Experimental Research in Mathematics, Providence, RI, being resident during the "Hamiltonian Methods in Dispersive and Wave Evolution Equations" program supported by NSF-DMS-1929284. ...
... Given the importance of DFT, FFT is a widely studied primitive and there exists vendor provided FFT libraries for CPUs [6][7][8]12], GPUs [10, 14, 30] and also vendor-independent auto-tuning FFT frameworks such as Fastest Fourier Transform in the West (FFTW) [19,20]. We believe that our proposed PIM FFT routines can be a good complement to these existing efficient FFT solutions. ...
Preprint
This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations of FFT on modern GPUs are memory bandwidth bound. As such, the memory bandwidth boost availed by commercial PIM solutions makes a case for PIM to accelerate FFT. To this end, we first deduce a mapping of FFT computation to a strawman PIM architecture representative of recent commercial designs. We observe that even with careful data mapping, PIM is not effective in accelerating FFT. To address this, we make a case for collaborative acceleration of FFT with PIM and GPU. Further, we propose software and hardware innovations which lower PIM operations necessary for a given FFT. Overall, our optimized PIM FFT mapping, termed Pimacolaba, delivers performance and data movement savings of up to 1.38$\times$ and 2.76$\times$, respectively, over a range of FFT sizes.
... Among Python simulators, [32,8] can be cited among the most general models, along with more application-specific Github packages and the Python wrappers of the abovementioned models. Finally, in [30] the authors use C language and the FFTW3 library [12]. ...
... In our implementation, we utilize the Fast Fourier Transform (FFT) algorithm to compute the discrete FT. Specifically, we employ the Cooley-Tukey algorithm [8,16], ...
Preprint
By integrating the self-attention capability and the biological properties of Spiking Neural Networks (SNNs), Spikformer applies the flourishing Transformer architecture to SNN design. It introduces a Spiking Self-Attention (SSA) module to mix sparse visual features using spike-form Query, Key, and Value, resulting in State-Of-The-Art (SOTA) performance on numerous datasets compared to previous SNN-like frameworks. In this paper, we demonstrate that the Spikformer architecture can be accelerated by replacing the SSA with an unparameterized Linear Transform (LT) such as Fourier and Wavelet transforms. These transforms are utilized to mix spike sequences, reducing the quadratic time complexity to log-linear time complexity. They alternate between the frequency and time domains to extract sparse visual features, showcasing powerful performance and efficiency. We conduct extensive experiments on image classification using both neuromorphic and static datasets. The results indicate that compared to the SOTA Spikformer with SSA, Spikformer with LT achieves higher Top-1 accuracy on neuromorphic datasets and comparable Top-1 accuracy on static datasets. Moreover, Spikformer with LT achieves approximately $29$-$51\%$ improvement in training speed, $61$-$70\%$ improvement in inference speed, and reduces memory usage by $4$-$26\%$ due to not requiring learnable parameters.
... (2.22), is evaluated using a pseudo-spectral method on each perpendicular plane. More precisely, the spatial derivatives, contained in the Poisson bracket, are obtained in the real space using a backward fast Fourier transform (Frigo & Johnson 2005). The fields are then multiplied in real space and the result transformed back to Fourier space using a forward fast Fourier transform, including a 2/3 anti-aliasing filter (Orszag 1971). ...
Preprint
Full-text available
This study presents a comprehensive benchmark and convergence analysis of the gyromoment (GM) approach in the gyrokinetic local flux-tube limit, focusing on the cyclone base case (CBC) and the Dimits shift. The GM approach demonstrates its efficacy in accurately capturing the nonlinear dynamics of the CBC with fewer velocity space points compared to the GENE code. Increasing velocity dissipation enhances convergence, albeit with a slight discrepancy in the saturated heat flux value. The GM approach successfully reproduces the Dimits shift and effectively captures its width compared to the ITG threshold. In the collisional case, we obtain a good agreement with previous global PIC results on transport. We report that the choice of collision model has a minimal impact both on the ITG growth rate and on the nonlinear saturated heat flux. We attribute this to the adiabatic electron model that impeaches the electron-ion collisions.
... jasonkaye/libdlr). The FFTW library [53] was used for FFTs. ...
Article
Full-text available
We consider the numerical solution of the real-time equilibrium Dyson equation, which is used in calculations of the dynamical properties of quantum many-body systems. We show that this equation can be written as a system of coupled, nonlinear, convolutional Volterra integro-differential equations, for which the kernel depends self-consistently on the solution. As is typical in the numerical solution of Volterra-type equations, the computational bottleneck is the quadratic-scaling cost of history integration. However, the structure of the nonlinear Volterra integral operator precludes the use of standard fast algorithms. We propose a quasilinear-scaling FFT-based algorithm which respects the structure of the nonlinear integral operator. The resulting method can reach large propagation times and is thus well-suited to explore quantum many-body phenomena at low energy scales. We demonstrate the solver with two standard model systems: the Bethe graph and the Sachdev-Ye-Kitaev model.
... For the line analysis, the efficient formation of Eq. 3, PROTON leverages the state-of-the-art FFTW library [11] to perform the DCT-II required for the analysis. The library is widely recognized as one of the fastest and most optimized solutions for performing fast Fourier transforms (FFTs), ensuring that PROTON can perform the analysis quickly and effectively. ...
... The EEG records were visually inspected on two-second epochs for removing all epochs containing artifacts from further analysis. The FFTW (Fastest Fourier Transform in the West) package (Frigo and Johnson, 2005) was applied to compute power spectra densities for the artifact-free epochs (see www.fftw.org for more detail). ...
Chapter
Full-text available
Diurnal variation in vigilance states and substates is associated with changes in electroencephalographic waves. These waves can serve as the indicators of the processes regulating rhythmicity of sleep and wakefulness states, alertness and sleepiness substates, performance and attention levels, etc. In particular, one of such indicators, alpha waves in eyes closed condition, can be implemented into quantitative description of the process of gradual weakening of drive for wake in the course of extended wakefulness. Feasibility of such description is illustrated by using the measurements of alpha waves for tracing changes in objective sleepiness of 130 and 48 participants of one- and two-day sleep deprivation experiments, respectively.
... with respect to the mean number of particles per cell n. We Fourier transform the density fields using FFTW3 (Frigo & Johnson 2005) and deconvolve the CIC window function, yielding the density modes δ m (k). We then compute the density power spectrum P mm (k), given by ...
Article
Full-text available
We build a field-level emulator for cosmic structure formation that is accurate in the nonlinear regime. Our emulator consists of two convolutional neural networks trained to output the nonlinear displacements and velocities of N -body simulation particles based on their linear inputs. Cosmology dependence is encoded in the form of style parameters at each layer of the neural network, enabling the emulator to effectively interpolate the outcomes of structure formation between different flat Lambda cold dark matter cosmologies over a wide range of background matter densities. The neural network architecture makes the model differentiable by construction, providing a powerful tool for fast field-level inference. We test the accuracy of our method by considering several summary statistics, including the density power spectrum with and without redshift space distortions, the displacement power spectrum, the momentum power spectrum, the density bispectrum, halo abundances, and halo profiles with and without redshift space distortions. We compare these statistics from our emulator with the full N -body results, the COmoving Lagrangian Acceleration (COLA) method, and a fiducial neural network with no cosmological dependence. We find that our emulator gives accurate results down to scales of k ∼ 1 Mpc ⁻¹ h , representing a considerable improvement over both COLA and the fiducial neural network. We also demonstrate that our emulator generalizes well to initial conditions containing primordial non-Gaussianity without the need for any additional style parameters or retraining.
... We use FFTW.jl [12] via the interface provided by SummationByPartsOperators.jl [34] for Fourier collocation methods and Trixi.jl [40,43] for discontinuous Galerkin discretizations of conservation laws. ...
Preprint
Full-text available
A posteriori error estimates based on residuals can be used for reliable error control of numerical methods. Here, we consider them in the context of ordinary differential equations and Runge-Kutta methods. In particular, we take the approach of Dedner & Giesselmann (2016) and investigate it when used to select the time step size. We focus on step size control stability when combined with explicit Runge-Kutta methods and demonstrate that a standard I controller is unstable while more advanced PI and PID controllers can be designed to be stable. We compare the stability properties of residual-based estimators and classical error estimators based on an embedded Runge-Kutta method both analytically and in numerical experiments.
Article
Subsonic turbulence plays a major role in determining properties of the intra cluster medium (ICM). We introduce a new Meshless Finite Mass (MFM) implementation in OpenGadget3 and apply it to this specific problem. To this end, we present a set of test cases to validate our implementation of the MFM framework in our code. These include but are not limited to: the soundwave and Kepler disk as smooth situations to probe the stability, a Rayleigh-Taylor and Kelvin-Helmholtz instability as popular mixing instabilities, a blob test as more complex example including both mixing and shocks, shock tubes with various Mach numbers, a Sedov blast wave, different tests including self-gravity such as gravitational freefall, a hydrostatic sphere, the Zeldovich-pancake, and a 1015M⊙ galaxy cluster as cosmological application. Advantages over SPH include increased mixing and a better convergence behavior. We demonstrate that the MFM-solver is robust, also in a cosmological context. We show evidence that the solver preforms extraordinarily well when applied to decaying subsonic turbulence, a problem very difficult to handle for many methods. MFM captures the expected velocity power spectrum with high accuracy and shows a good convergence behavior. Using MFM or SPH within OpenGadget3 leads to a comparable decay in turbulent energy due to numerical dissipation. When studying the energy decay for different initial turbulent energy fractions, we find that MFM performs well down to Mach numbers $\mathcal {M}\approx 0.01$. Finally, we show how important the slope limiter and the energy-entropy switch are to control the behavior and the evolution of the fluids.
Article
The application of multidimensional optical sensing technologies, such as the spectral light field (SLF) imager, has become increasingly common in recent years. The SLF sensors provide information in the form of one-dimensional spectral data, two-dimensional spatial data, and two-dimensional angular measurements. Spatial-spectral and angular data are essential in a variety of fields, from computer vision to microscopy. Beam-splitters or expensive camera arrays are required for the usage of SLF sensors. The paper describes a low-cost RGB light field camera-based compressed snapshot SLF imaging method. Inspired by the compressive sensing paradigm, the four dimensional SLF can be reconstructed from a measurement of an RGB light field camera via a network which is proposed by utilizing a U-shaped neural network with multi-head self-attention and unparameterized Fourier transform modules. This method is capable of gathering images with a spectral resolution of 10 nm, angular resolution of 9 × 9, and spatial resolution of 622 × 432 within the spectral range of 400 to 700 nm. It provides us an alternative approach to implement the low cost SLF imaging.
Article
We construct and dynamically evolve dipolar self-interacting scalar boson stars in a model with sextic (+ quartic) self-interactions. The domain of existence of such dipolar Q-stars has a similar structure to that of the fundamental monopolar stars of the same model. For the latter it is structured in a Newtonian plus a relativistic branch, wherein perturbatively stable solutions exist, connected by a middle unstable branch. Our evolutions support similar dynamical properties of the dipolar Q-stars that: 1) in the Newtonian and relativistic branches are dynamically robust over time scales longer than those for which dipolar stars without self-interactions are seen to decay; 2) in the middle branch migrate to either the Newtonian or the relativistic branch; 3) beyond the relativistic branch decay to black holes. Overall, these results strengthen the observation, seen in other contexts, that self-interactions can mitigate dynamical instabilities of scalar boson star models.
Article
Full-text available
Numerically “exact” methods addressing the dynamics of coupled electron–phonon systems have been intensively developed. Nevertheless, the corresponding results for the electron mobility μdc are scarce, even for the one-dimensional (1d) Holstein model. Building on our recent progress on single-particle properties, here we develop the momentum-space hierarchical equations of motion (HEOM) method to evaluate real-time two-particle correlation functions of the 1d Holstein model at a finite temperature. We compute numerically “exact” dynamics of the current–current correlation function up to real times sufficiently long to capture the electron’s diffusive motion and provide reliable results for μdc in a wide range of model parameters. In contrast to the smooth ballistic-to-diffusive crossover in the weak-coupling regime, we observe a temporally limited slow-down of the electron on intermediate time scales already in the intermediate-coupling regime, which translates to a finite-frequency peak in the optical response. Our momentum-space formulation lowers the numerical effort with respect to existing HEOM-method implementations, while we remove the numerical instabilities inherent to the undamped-mode HEOM by devising an appropriate hierarchy closing scheme. Still, our HEOM remains unstable at too low temperatures, for too strong electron–phonon coupling, and for too fast phonons.
Article
Full-text available
In this paper, inhomogeneous chemical kinetics are simulated by describing the concentrations of interacting chemical species by a linear expansion of basis functions in such a manner that the coupled reaction and diffusion processes are propagated through time efficiently by tailor-made numerical methods. The approach is illustrated through modelling $$\alpha$$ α - and $$\gamma$$ γ -radiolysis in thin layers of water and at their solid interfaces from the start of the chemical phase until equilibrium was established. The method’s efficiency is such that hundreds of such systems can be modelled in a few hours using a single core of a typical laptop, allowing the investigation of the effects of the underlying parameter space. Illustrative calculations showing the effects of changing dose-rate and water-layer thickness are presented. Other simulations are presented which show the approach’s capability to solve problems with spherical symmetry (an approximation to an isolated radiolytic spur), where the hollowing out of an initial Gaussian distribution is observed, in line with previous calculations. These illustrative simulations show the generality and the computational efficiency of this approach to solving reaction-diffusion problems. Furthermore, these example simulations illustrate the method’s suitability for simulating solid-fluid interfaces, which have received a lot of experimental attention in contrast to the lack of computational studies.
Article
Within density functional theory, we have studied self-sustained, deformable, rotating liquid He cylinders subject to planar deformations. In the normal fluid He3 case, the kinetic energy has been incorporated in a semiclassical Thomas-Fermi approximation. In the He4 case, our approach takes into account its superfluid character. For this study, we have chosen to limit our investigation to vortex-free configurations where angular momentum is exclusively stored in capillary waves on a deformed cross-section cylinder. Only planar deformations leading to noncircular cross sections have been considered, as they aim to represent the cross section of the very large deformed He drops discussed in the experiments. Axisymmetric Rayleigh instabilities, always present in fluid columns, have been set aside. The calculations allow us to carry out a comparison between the rotational behavior of a normal, rotational fluid (He3) and a superfluid, irrotational fluid (He4).
Chapter
In situ visualization and analysis is a valuable yet under utilized commodity for the simulation community. There is hesitance or even resistance to adopting new methodologies due to the uncertainties that in situ holds for new users. There is a perceived implementation cost, maintenance cost, risk to simulation fault tolerance, potential lack of scalability, a new resource cost for running in situ processes, and more. The list of reasons why in situ is overlooked is long. We are attempting to break down this barrier by introducing Inshimtu. Inshimtu is an in situ “shim” library that enables users to try in situ before they buy into a full implementation. It does this by working with existing simulation output files, requiring no changes to simulation code. The core visualization component of Inshimtu is ParaView Catalyst, allowing it to take advantage of both interactive and non-interactive visualization pipelines that scale. We envision Inshimtu as stepping stone to show users the value of in situ and motivate them to move to one of the many existing fully-featured in situ libraries available in the community. We demonstrate the functionality of Inshimtu with a scientific workflow on the Shaheen II supercomputer.Inshimtu is available for download at: https://github.com/kaust-vislab/Inshimtu-basic.
Chapter
The emergence of RISC-V as a reduced instruction set architecture has brought several advantages such as openness, flexibility, scalability, and efficiency compared to other commercial ISAs. It has gained significant popularity, especially in the field of high-performance computing. However, there is a lack of high-performance implementations of numerical algorithms, including the Fast Fourier Transform (FFT) algorithm. To address this issue, the paper focuses on optimizing the butterfly network, butterfly kernel, and single instruction multiple data (SIMD) operations to achieve efficient calculations for FFT with a computation scale of \(2^n\) on a RISC-V architecture CPUs. The experimental results demonstrate a significant improvement in the performance of the FFT algorithm library implemented using the proposed optimizations compared to existing implementations like FFTW on RISC-V CPUs.KeywordsFFTRISC-VSIMDCooley-Tukey
Article
We propose a novel graphics processing unit (GPU) algorithm that can handle a large‐scale 3D fast Fourier transform (i.e., 3D‐FFT) problem whose data size is larger than the GPU's memory. A 1D FFT‐based 3D‐FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data‐transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large‐scale benchmarks and compare its performance with the state‐of‐the‐art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU‐based 3D‐FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive‐scale problems, whereas our method's performance is stable.
Article
Full-text available
One key aspect of coarsening following a quench below the critical temperature is domain growth. For the non-conserved Ising model a power-law growth of domains of like spins with exponent α=1/2 is predicted. Including recent work, it was not possible to clearly observe this growth law in the special case of a zero-temperature quench in the three-dimensional model. Instead a slower growth with α<1/2 was reported. We attempt to clarify this discrepancy by running large-scale Monte Carlo simulations on simple-cubic lattices with linear lattice sizes up to L=2048 employing an efficient GPU implementation. Indeed, at late times we measure domain sizes compatible with the expected growth law—but surprisingly, at still later times domains even grow superdiffusively, i.e., with α>1/2 . We argue that this new problem is possibly caused by sponge-like structures emerging at early times.
Article
Scattering networks on Euclidean domains are capable of analytically realizing signal representation invariant to transformations such as translation, rotation and scaling with wavelets. However, existing scattering networks defined on the sphere and Riemannian manifolds only consider axisymmetric wavelets and are restricted in representation by the isotropic filter structures. In this paper, we propose a novel anisotropic spherical scattering network to achieve multi-scale directional representation for spherical signals. The scattering transform is realized by cascading directional spin wavelets and modulus operators to propagate localized signal components of varying scales and directions in a recursive manner without parameter tuning. Furthermore, a combined spherical scattering network is presented to guarantee the invariance to arbitrary rotations about the z-axis by incorporating the scattering coefficients along the dimension of rotation angle. To our best knowledge, the proposed network is the first to achieve multi-scale anisotropic filtering via the scattering transform on the sphere. We demonstrate in theory that the proposed network is energy preserving, invariant to azimuthal rotations and stable to diffeomorphisms. Extensive experimental results on benchmark datasets show that the proposed network achieves state-of-the-art performance in spherical signal analysis on various 2-D and 3-D datasets mapped to the sphere.
Preprint
Full-text available
We show how to learn discrete field theories from observational data of fields on a space-time lattice. For this, we train a neural network model of a discrete Lagrangian density such that the discrete Euler--Lagrange equations are consistent with the given training data. We, thus, obtain a structure-preserving machine learning architecture. Lagrangian densities are not uniquely defined by the solutions of a field theory. We introduce a technique to derive regularisers for the training process which optimise numerical regularity of the discrete field theory. Minimisation of the regularisers guarantees that close to the training data the discrete field theory behaves robust and efficient when used in numerical simulations. Further, we show how to identify structurally simple solutions of the underlying continuous field theory such as travelling waves. This is possible even when travelling waves are not present in the training data. This is compared to data-driven model order reduction based approaches, which struggle to identify suitable latent spaces containing structurally simple solutions when these are not present in the training data. Ideas are demonstrated on examples based on the wave equation and the Schr\"odinger equation.
Article
Full-text available
The performance of machine learning algorithms, when used for segmenting 3D biomedical images, does not reach the level expected based on results achieved with 2D photos. This may be explained by the comparative lack of high-volume, high-quality training datasets, which require state-of-the-art imaging facilities, domain experts for annotation and large computational and personal resources. The HR-Kidney dataset presented in this work bridges this gap by providing 1.7 TB of artefact-corrected synchrotron radiation-based X-ray phase-contrast microtomography images of whole mouse kidneys and validated segmentations of 33 729 glomeruli, which corresponds to a one to two orders of magnitude increase over currently available biomedical datasets. The image sets also contain the underlying raw data, threshold- and morphology-based semi-automatic segmentations of renal vasculature and uriniferous tubules, as well as true 3D manual annotations. We therewith provide a broad basis for the scientific community to build upon and expand in the fields of image processing, data augmentation and machine learning, in particular unsupervised and semi-supervised learning investigations, as well as transfer learning and generative adversarial networks.
Article
The instability of Stokes waves, steady propagating waves on the surface of an ideal fluid of infinite depth, is a fundamental problem in the field of nonlinear science. The dominant instability of these waves depends on their steepness. For small amplitude waves, it is well known that the Benjamin-Feir or modulational instability dominates the dynamics of a wave train. We demonstrate that for steeper waves, an instability caused by disturbances localized at the wave crest vastly surpasses the growth rate of the modulational instability. These dominant localized disturbances are either coperiodic with the Stokes wave or have twice its period. In either case, the nonlinear evolution of the instability leads to the formation of plunging breakers. This phenomenon explains why long propagating ocean swell consists of small-amplitude waves.
Article
Curved-spacetime geometric-optics maps derived from a deep photometric survey should contain information about the three-dimensional matter distribution and thus about cosmic voids in the survey, despite projection effects. We explore to what degree sky-plane geometric-optics maps can reveal the presence of intrinsic three-dimensional voids. We carry out a cosmological N-body simulation and place it further than a gigaparsec from the observer, at redshift 0.5. We infer three-dimensional void structures using the watershed algorithm. Independently, we calculate a surface overdensity map and maps of weak gravitational lensing and geometric-optics scalars. We propose and implement a heuristic algorithm for detecting (projected) radial void profiles from these maps. We find in our simulation that given the sky-plane centres of the three-dimensional watershed-detected voids, there is significant evidence of finding corresponding void centres in the surface overdensity Σ, the averaged weak-lensing tangential shear $\overline{{\gamma }_{\perp }}$, the Sachs expansion θ, and the Sachs shear modulus |σ|. Recovering the centres of the three-dimensional voids from the sky-plane information alone is significant given the Sachs expansion θ, or the Sachs shear |σ|, mildly significant given the weak-lensing shear $\overline{{\gamma }_{\perp }}$, and not significant for the surface overdensity Σ. Void radii are uncorrelated between three-dimensional and two-dimensional voids; our algorithm is not designed to distinguish voids that are nearly concentric in projection. This investigation shows preliminary evidence encouraging observational studies of gravitational lensing through individual voids, either blind or with spectroscopic/photometric redshifts. The former case – blind searches – should generate falsifiable predictions of intrinsic three-dimensional void centres.
Chapter
The numerical modeling of optical signal propagation in fibers usually requires high-performance solvers for the nonlinear Schrödinger equation. Multicore CPUs and Graphical Processing Units (GPUs) are usually used for highly intensive parallel computations. We consider several implementations of solvers for the generalized multimode nonlinear Schrödinger equation. Reference MATLAB code (freely available at https://github.com/WiseLabAEP/GMMNLSE-Solver-FINAL) of the split-step Fourier method (SSFM) and the massively parallel algorithm (MPA) have been redesigned using the C++ OpenMP interface and C-oriented Compute Unified Device Architecture (CUDA) by Nvidia. Using this code, we explore several approaches for parallelization of computations. We show that, for small numbers of modes, the OpenMP implementation of the MPA is up to an order of magnitude faster than the GPU implementation owing to data transfer overheads, while the GPU implementation overperforms the CPU one starting from 10 modes. We also give several practical recommendations regarding the integration step size of the solvers.KeywordsNonlinear Schrödinger equationSplit step Fourier methodMassively parallel algorithm
Article
Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ ( NT ) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o ( NT ) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 10 ⁷ cells for around 10 ⁵ timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils
Chapter
Full-text available
The Fast Fourier Transform (FFT) algorithm that calculates the Discrete Fourier Transform (DFT) is one of the major breakthrough in scientific computing and is now an indispensable tool in a vast number of fields. Unfortunately, software that provide fast computation of DFT via FFT differ vastly in functionality as well as uniformity. A widely accepted Applications Programmer Interface (API) for DFT would advance the field of scientific computing significantly. In this paper, we formulate an API for DFT computation that encompasses all the functionality that are offered by a number of popular packages combined, allows easy porting from existing codes, and exhibits a systematic naming convention with relatively short calling sequences.
Article
Full-text available
We propose a new algorithm for fast Fourier transforms. This algorithm features uniformly long vector lengths and stride one data access. Thus it is well adapted to modern vector computers like the Fujitsu VP2200 having several floating point pipelines per CPU and very fast stride one data access. It also has favorable properties for distributed memory computers as all communication is gathered together in one step. The algorithm has been implemented on the Fujitsu VP2200 using the basic subroutines for fast Fourier transforms discussed elsewhere. We develop the theory of index digit permutations to some extent. With this theory we can derive the splitting formulas for almost all mixed-radix FFT algorithms known so far. This framework enables us to prove these algorithms but also to derive our new algorithm. The development and systematic use of this framework is new and allows us to simplify the proofs which are now reduced to the application of matrix recursions.
Article
Full-text available
Fast Fourier transform (FFT)-based computations can be far more accurate than the slow transforms suggest. Discrete Fourier transforms computed through the FFT are far more accurate than slow transforms, and convolutions computed viaFFT are far more accurate than the direct results. However, these results depend critically on the accuracy of the FFT software employed, which should generally be considered suspect. Popular recursions for fast computation of the sine/cosine table (or twiddle factors) are inaccurate due to inherent instability. Some analyses of these recursions that have appeared heretofore in print, suggesting stability, are incorrect. Even in higher dimensions, the FFT is remarkably stable.
Article
Full-text available
An investigation into history of Fast Fourier Transform (FFT) algorithm is considered. It deals mostly with work of Carl Friedrick Gauss, an eminent German mathematician who apparently described an algorithm similar to FFT for the computation of the coefficient of a finite Fourier series. Historical reference and evidences of Gausses algorithm are presented. It is also shown that various FFT-type algorithms were used in Great Britain and elsewhere in the nineteenth century, but were unrelated to the work of Gauss and were, in fact, not as general or well-formulated as Gauss' work. Almost one-hundred years passed between the publication of Gauss' algorithm and the modern rediscovery of this approach.
Conference Paper
Full-text available
Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW’s inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute effcient discrete cosine transforms and show promising speed-ups over a vendor-tuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widely-used kernels.
Conference Paper
Full-text available
Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2m -point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.
Article
Full-text available
SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be re-optimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical framework that concisely describes signal transforms and their fast algorithms; the formula generator that captures at the algorithmic level the degrees of freedom in expressing a particular signal processing transform; the formula translator that encapsulates the compilation degrees of freedom when translating a specific algorithm into an actual code implementation; and, finally, an intelligent search engine that finds within the large space of alternative formulas and implementations the “best” match to the given computing platform. We present empirical data that demonstrate the high performance of SPIRAL generated code.
Article
Given an instruction set, the superoptimizer finds the shortest program to compute a function. Startling programs have been generated, many of them engaging in convoluted bit-fiddling bearing little resemblance to the source programs which defined the functions. The key idea in the superoptimizer is a probabilistic test that makes exhaustive searches practical for programs of useful size. The search space is defined by the processor's instruction set, which may include the whole set, but it is typically restricted to a subset. By constraining the instructions and observing the effect on the output program, one can gain insight into the design of instruction sets. In addition, superoptimized programs may be used by peephole optimizers to improve the quality of generated code, or by assembly language programmers to improve manually written code.
Article
Many versions of the fast Fourier transform require a reordering of either the input or the output data that corresponds to reversing the order of the bits in the array index. There has been a surprisingly large number of papers on this subject in the recent literature. This paper collects 30 methods for bit reversing an array. Each method was recoded into a uniform style in Fortran and its performance measured on several different machines, each with a different memory system. This paper includes a description of how the memories of the machines operate to motivate two new algorithms that perform substantially better than the others.
Chapter
Publisher Summary This chapter provides an overview on vectorizing the FFTs. The fast Fourier transform (FFT) is the most well known of all algorithms. It is superior to the slow transform and has applications in all areas of scientific computing. The term FFT was applied to a specific algorithm for the rapid computation of the discrete complex Fourier transform; however, it has become a generic term that is applied to any one of a large number of algorithms that compute the complex as well as other Fourier transforms. Many algorithms exist for a given Fourier transform, and when they are applied to a particular sequence, the result is the same. However, the algorithms differ in the ways that intermediate results are computed and stored. It is these important differences that provide the algorithms with unique properties that make one or the other more attractive for a particular application.
Article
The Cooley-Tukey input and output maps are used to developed the general topology of an FFT algorithm for a given length. Within this algorithm the twiddle factors can be placed in a nearly uncountable number of arrangements. From a given machine architecture one can choose a optimality criteria for the algorithm. With this optimality criteria the set of possible algorithms can be search for the best one. A exhaustive search is intractable for reasonable length algorithms. Thus, the search space has been hueristically reduced to make the problem tractable. An example of a length 32 minimum number of multiplies is included.
Article
The adaptation of the Cooley—Tukey, the Pease and the Stockham FFT's to vector computers is discussed. Each of these algorithms computes the same result namely, the discrete Fourier transform. They differ only in the way that intermediate computations are stored. Yet it is this difference that makes one or the other more appropriate depending on the application. This difference also influences the computational efficiency on a vector computer and motivates the development of methods to improve efficiency. Each of the FFT's is defined rigorously by a short expository FORTRAN program which provides the basis for discussions about vectorization. Several methods for lengthening vectors are discussed, including the case of multiple and multi-dimensional transforms where M sequences of length N can be transformed as a single sequence of length MN using a ‘truncated’ FFT. The implementation of an in place FFT on a computer with memory-to-memory architecture is made possible by in place matrix-vector multiplication.
Article
The "Fast Fourier Transform" has now been widely known for about a year. During that time it has had a major effect on several areas of computing, the most striking example being techniques of numerical convolution, which have been completely revolutionized. What exactly is the "Fast Fourier Transform"?
Conference Paper
Cooley and Tukey have disclosed a procedure for synthesizing and analyzing Fourier series for discrete periodic complex functions. For functions of period N, where N is a power of 2, computation times are proportional to N log2 N as expressed in Eq. (0).
Article
An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0
Conference Paper
Computers with multiple levels of caching have traditionally required techniques such as data blocking in order for algorithms to exploit the cache hierarchy effectively. These “cache-aware” algorithms must be properly tuned to achieve good performance using so-called “voodoo” parameters which depend on hardware properties, such as cache size and cache-line length. Surprisingly, however, for a variety of problems — including matrix multiplication, FFT, and sorting — asymptotically optimal “cache-oblivious” algorithms do exist that contain no voodoo parameters. They perform an optimal amount of work and move data optimally among multiple levels of cache. Since they need not be tuned, cache-oblivious algorithms are more portable than traditional cache-aware algorithms. We employ an “ideal-cache” model to analyze these algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal across a multilevel cache hierarchy. We also show that the assumption of optimal replacement made by the ideal-cache model can be simulated efficiently by LRU replacement. We also provide some empirical results on the effectiveness of cache-oblivious algorithms in practice.
Conference Paper
Most Cooley-Tukey, in-place Fast Fourier Transform (FFT) algorithms result in the output being permuted or scrambled in order. For a radix-2 FFT, this order can be easily found by reversing the order of the bits of the address, and the unscrambler is called a bit-reversed counter. In some machines, this unscrambling takes from 10% to 50% of the total execution time. This paper presents an in-place, radix-2 FFT that does the unscrambling while the FFT is being calculated rather than as a separate process. The theoretical framework is based on index maps [1] and ideas used on the in-place, in-order prime factor FFT (PFA) [2]. The non-scrambled algorithm is implemented in FORTRAN. The size of the program is essentially the same as the regular radix-2 FFT with its bit-reversed counter.
Article
New agents directed against molecular targets have now been tested in large prospectively randomized clinical trials for patients with untreated advanced non-small cell lung cancer (NSCLC). The trials have included small molecule inhibitors of receptor tyrosine kinases (gefitinib and erlontinib), monoclonal antibodies directed against either the ligands (bevicizumab against vascular endothelial growth factor) or their receptors (trastuzumab against ErbB2) and antisense nucleotides directed against mRNA coding for molecular targets (ISIS 3521 against protein kinase C alpha). The largest mature trials have included agents directed against ErbB1 (epidermal growth factor receptor). Two different small molecule inhibitors of erbB1 (gefitinib and erlontinib) and have been tested in 4 large international randomized trials. Two trials treated patients with advanced NSCLC with either gemcitabine and cisplatin or the same chemotherapy drugs combined with two different doses of gefitinib or one dose of erlontinib (INTACT 1 and TALENT). The other two trials treated patients with advanced NSCLC with either paclitaxel and carboplatin or the same chemotherapy drugs combined with either one of two doses of gefitinib or a single dose of erlontinib (INTACT 2 or TRIBUTE respectively). The results of INTACT 1 and INTACT2 showed no significant difference in survival between those treated with conventional combination chemotherapy and those treated with the same chemotherapy plus gefitinib. The results of the TRIBUTE and TALENT studies showed their primary endpoints of improving overall survival were not met. The monoclonal antibody, bevacizumab, is being tested in an ongoing clinical trial. The trial will compare the outcome of patients with untreated advanced adenocarcinoma of the lung treated with paclitaxel plus carboplatin to those treated with the same chemotherapy drugs plus bevacizumab. The results are scheduled to be available in 2005–2006. Although the initial trials adding newly developed molecular targeted agents have not shown initial success when added to conventional combination chemotherapy, further clinical trials are needed to ultimately define their role in the treatment of non-small cell lung cancer.
Article
The publication of the Cooley-Tukey fast Fourier transform (FFT) algorithm in 1965 has opened a new area in digital signal processing by reducing the order of complexity of some crucial computational tasks like Fourier transform and convultion from N2 to N log2, where N is the problem size. The development of the major algorithms (Cooley-Tukey and split-radix FFT, prime factor algorithm and Winograd fast Fourier transform) is reviewed. Then, an attempt is made to indicate the state of the art on the subject, showin the standing of researh, open problems and implementations.ZusammenfassungDie Publikation von Cooley-Tukey's schnellem Fourier Transformations Algorithmus in 1965 brachte eine neue Area in der digitalen signaverarbeitung weil die Ordnung der Komplexität von gewissen zentralen Berechnungen, wie die Fourier Transformations und die digitale Faltung, von N2 zu Nlog2N reduziert wurden (wo N die Problemgrösse darstellt). Die Entwickflung der wichtigsten Algorithmen (Cooley-Tukey und Split-Radix FFT), Prime Factor Algorithmus und Winograd's schneller Fourier Transformation) ist nachvollzogen. Dann wird, den Stand des Feldes zu beschreiben, um zu zeigen wo die Forschung steht, was für Probleme noch offenstehen, wie zum Beispel in Implementierungen.RésuméLa publication de l'algorithme de Cooley-Tukey pour la transformation de Fourier rapide a ouvert une nouvelle ère dans traitement numérique des signaux, en résiduisant l'ordre de comlexité de problèmes cruciaux, comme la transformation de Fourier ou la convulution de N2 à Nlog2N (où N est la taille du problème). Le dévelopment des algorithmes principaux (Cooley-Tukey, split-radix FFT, algorithmes des facteurs premiers, et transformée rapidem de Winograd) est déscrit. Ensuite, l'état de l'art est donné, et on parle problémes ouverts et des implantations.
Article
It is shown that the self-sorting variants of the mixed-radix FFT algorithm may be specialized to the case of real or conjugate-symmetric input data. In comparison with conventional procedures, savings of around 20% are achieved in terms of operation counts. A multiple real/half-complex transform package on the Cray-1, based on the algorithms described here, achieves a 30% saving in CPU time compared with a package using conventional algorithms. A similar package has also been implemented on the Cyber 205.
Conference Paper
In this paper, we propose a blocking algorithm for computing large one-dimensional fast Fourier transform (FFT) on cache-based processors. Our proposed FFT algorithm is based on the six-step FFT algorithm. We show that the block six-step FFT algorithm improves performance by effectively utilizing the cache memory. Performance results of one-dimensional FFTs on the Sun Ultra 10 and PentiumIII PC are reported. We succeeded in obtaining performance of about 108MFLOPS on the Sun Ultra 10 (UltraSPARC-IIi 333MHz) and about 247MFLOPS on the 1GHz PentiumIII PC for 220-point FFT.
Conference Paper
Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machine-specific hand tuning. We have developed a methodology whereby near-peak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, High-Performance, ANSI C (PHiPAC, pronounced 'fee-pack'). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that find the best parameters for a given system. We report on a BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k, and SGI Octane R10k, and over 80% of peak on the SGI Indigo R4k. The resulting routines are competitive with vendor-optimized BLAS GEMMs.
Conference Paper
Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW's inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute efficient discrete cosine transforms and show promising speed-ups over a vendor-tuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widely-used kernels.
Article
A new procedure is presented for calculating the complex, discrete Fourier transform of real-valued time series. This procedure is described for an example where the number of points in the series is an integral power of two. This algorithm preserves the order and symmetry of the Cooley-Tukey fast Fourier transform algorithm while effecting the two-to-one reduction in computation and storage which can be achieved when the series is real. Also discussed are hardware and software implementations of the algorithm which perform only (N/4) log2 (N/2) complex multiply and add operations, and which require only N real storage locations in analyzing each N-point record.
Article
A framework for synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms such as the fast Fourier transform (FFT) and Strassen's matrix multiplication is presented. This framework is based on an algebraic representation of the algorithms, which involves the tensor (Kronecker) product and other matrix operations. This representation is useful in analyzing the communication implications of computation partitioning and data distributions. The programs are synthesized under two different target program models. These two models are based on different ways of managing the distribution of data for optimizing communication. The first model uses point-to-point interprocessor communication primitives, whereas the second model uses data redistribution primitives involving collective all-to-many communication. These two program models are shown to be suitable for different ranges of problem size. The methodology is illustrated by synthesizing communication-efficient programs for the FFT. This framework has been incorporated into the EXTENT system for automatic generation of parallel/vector programs for block recursive algorithms.
Article
Abstract A single signal processing algorithm can be represented by many mathematically equivalent formu - las However, when these formulas are implemented in code and run on real machines, they have very different runtimes Unfortunately, it is extremely difficult to model this broad performance range Further, the space of formulas for real signal transforms is so large that it is impossible to search it exhaustively for fast implementations We approach this search question as a control learning problem We present a new method for learning to generate fast formulas , allowing us to intelligently search through only the most promising formulas Our approach incorporates signal processing knowledge, hardware features, and formula performance data to learn to construct fast formulas Our method learns from performance data for a few formulas of one size and then can construct formulas that will have the fastest runtimes possible across many sizes
Article
There are many published algorithms for transposing a rectangular matrix in place, however none of these are suited to a vector processor. In this work we describe four methods and compare their speed and memory requirements to copying the matrix into workspace then back again in transposed order.