An astronomer-turned-sleuth traces a German trespasser on our military networks, who slipped through operating system security holes and browsed through sensitive databases. Was it espionage?
All content in this area was uploaded by John Gustafson
Content may be subject to copyright.
A preview of the PDF is not available
... According to Amdahl's Law [10], the optimal DoP for any resolution constant at one if we aim to minimize the cumulative resource occupancy time at a single request level. However, this configuration of parallelism severely degrades the performance for high-resolution requests and blocks subsequent ones. ...
The Text-to-Video (T2V) model aims to generate dynamic and expressive videos from textual prompts. The generation pipeline typically involves multiple modules, such as language encoder, Diffusion Transformer (DiT), and Variational Autoencoders (VAE). Existing serving systems often rely on monolithic model deployment, while overlooking the distinct characteristics of each module, leading to inefficient GPU utilization. In addition, DiT exhibits varying performance gains across different resolutions and degrees of parallelism, and significant optimization potential remains unexplored. To address these problems, we present DDiT, a flexible system that integrates both inter-phase and intra-phase optimizations. DDiT focuses on two key metrics: optimal degree of parallelism, which prevents excessive parallelism for specific resolutions, and starvation time, which quantifies the sacrifice of each request. To this end, DDiT introduces a decoupled control mechanism to minimize the computational inefficiency caused by imbalances in the degree of parallelism between the DiT and VAE phases. It also designs a greedy resource allocation algorithm with a novel scheduling mechanism that operates at the single-step granularity, enabling dynamic and timely resource scaling. Our evaluation on the T5 encoder, OpenSora SDDiT, and OpenSora VAE models across diverse datasets reveals that DDiT significantly outperforms state-of-the-art baselines by up to 1.44x in p99 latency and 1.43x in average latency.
... John L. Gustafson, one of the fathers of massive parallelism, notes that powerful systems grant access to larger problems, and in turn increasingly large problems require powerful, parallel systems to be solved [2]: "One does not take a fixedsize problem and run it on various numbers of processors except when doing academic research; in practice, the problem size scales with the number of processors. When given a more powerful processor, the problem generally expands to make use of the increased facilities.". ...
This report describes various parallelization techniques and results obtained while tackling the acceleration of the conjugate gradient algorithm on heterogeneous systems. CPU-, GPU-and FPGA-based approaches are explored for increasing matrix sizes and benchmarked on the MeluXina supercomputer and on a NVIDIA GH200 Grace Hopper Superchip cluster provided by Università di Ferrara. All implementations are effective, and while CPUs prove optimal for smaller problems, GPUs shine as dimensions increase by providing high FLOPS while maintaining high precisions. FPGAs nodes provided by Meluxina are explored, and constitute a viable and energy efficient alternative to the other architectures. CG workloads are quite heavy from an energy consumption standpoint, but Meluxina cooling systems and hardware allow for a high FLOPS to Power Consumption ratio (26.957 GF LOP S/W).
... Some general remarks: In the literature there exist further parallel performance measures and models. For instance there is Gustafson's law [25], which states that an arbitrarily large problem can be efficiently parallelized. Another elaborate model is the work-span-model [13,9,43]. ...
Unlike in the past, today many modern database systems hold their working
set, or even the whole database, completely in main memory. As a result, the
performance bottleneck has shifted from disk access to main memory access
and even computation. Since CPU clock rates have leveled off, one cannot
hope for an existing sequential algorithm to do its computations faster over
time. From a computer architecture point of view, this issue is tackled by
providing hardware-parallelism via multiple CPU cores. From a software de-
velopment point of view, parallel algorithms and concurrent data structures
to efficiently utilize the parallel hardware environment must be provided.
In this thesis, we present a parallel physical algebra, i.e., a set of algorithms
that implement the operators of the relational algebra. The algebra uses a
state of the art push-based approach and passes single tuples through oper-
ator pipelines. The drivers of parallelism in this algebra are (1) a parallel
scan operator, which initiates parallel processing of tuples and (2) concurrent
hash tables, which allow for storing and extracting multiple entries simulta-
neously. To limit the scope of the thesis, we restrict ourselves to hash-based
approaches. The algebra is backed by an efficient infrastructure consisting
of a thread pool and a concurrent queue. These allow for dynamically dis-
tributing jobs over threads and optionally binding threads to NUMA nodes
or CPU cores.
We evaluate our algebra through performance tests where we aim for mini-
mizing the run-time and maximizing the speed-up. In order to give general
statements we investigate different components in isolation under various
workloads. Thereby we try to find (1) the optimal parameter space, (2)
showcase the actual behavior and (3) approximate the behavior by func-
tions.
Finally, we draw a conclusion about our approach and the performance of
our implementation of a parallel physical algebra.
Fast and scalable data transfer is crucial in today's decentralized data ecosystems and data-driven applications. Example use cases include transferring data from operational systems to consolidated data warehouse environments, or from relational database systems to data lakes for exploratory data analysis or ML model training. Traditional data transfer approaches rely on efficient point-to-point connectors or general middleware with generic intermediate data representations. Physical environments (e.g., on-premise, cloud, or consumer nodes) also have become increasingly heterogeneous. Existing work still struggles to achieve both, fast and scalable data transfer as well as generality in terms of heterogeneous systems and environments. Hence, in this paper, we introduce a holistic data transfer framework. Our XDBC framework splits the data transfer pipeline into logical components and provides a wide variety of physical implementations for these components. This design allows a seamless integration of different systems as well as the automatic optimizations of data transfer configurations according to workload and environment characteristics. Our evaluation shows that XDBC outperforms state-of-the-art generic data transfer tools by up to 5x, while being on par with specialized approaches.
A direct approach of the boundary-element method to solving three-dimensional boundary-value problems of poroelastic dynamics is considered. To increase the efficiency of numerical modeling, elements of parallel computing are used. The results of computer experiments are presented.
It is challenging to store and process massive unstructured data for key-value storage systems that require high concurrency, high performance, and low latency. Log-Structured Merge (LSM) trees-based Key-value stores or KV stores are widely adopted for enhanced write performance. Existing KV stores are primarily deployed on a CPU-centric architecture, which necessitates moving data to the CPU from memory or storage devices for processing. This is particularly problematic during the compaction process, which involves substantial data movement and rewrite. This tradition consumes bandwidth and computational resources, leading to write amplification and impairing system performance. Near-Data Processing (NDP) devices mitigate this issue by processing data at the storage location, thereby reducing data movement cost. Recognizing that the computational power of a single NDP device is insufficient for the demands of large-scale unstructured data processing, we propose JMStore – a multi-NDP key-value store based on a hash data organization. By offloading computational tasks to multiple NDP devices, JMStore collaboratively optimizes the compaction process to address the data movement issue as well as the mismatch between large-scale in workloads data and computational power on an NDP. We design a key-value store programming model and data organization for a multi-NDP architecture, enabling the system to leverage the hardware efficiency and parallelism of multiple NDP devices to significantly optimize system performance. We also propose a strategy to balance storage and computation resources under a hash layout. Compared to the latest single NDP KV store (PStore) and the multi-NDP KV store (MStore), JMStore demonstrates significant performance improvements under DB_Bench and YCSB-C with read-write mixed workloads: the peak improvements can reach a factor of 10.
This research introduces an innovative probabilistic method for examining torsional stress behavior in spherical shell structures through Monte Carlo simulation techniques. The spherical geometry of these components creates distinctive computational difficulties for conventional analytical and deterministic numerical approaches when solving torsion-related problems. The authors develop a comprehensive mesh-free Monte Carlo framework built upon the Feynman–Kac formula, which maintains the geometric symmetry of the domain while offering a probabilistic solution representation via stochastic processes on spherical surfaces. The technique models Brownian motion paths on spherical surfaces using the Euler–Maruyama numerical scheme, converting the Saint-Venant torsion equation into a problem of stochastic integration. The computational implementation utilizes the Fibonacci sphere technique for achieving uniform point placement, employs adaptive time-stepping strategies to address pole singularities, and incorporates efficient algorithms for boundary identification. This symmetry-maintaining approach circumvents the mesh generation complications inherent in finite element and finite difference techniques, which typically compromise the problem’s natural symmetry, while delivering comparable precision. Performance evaluations reveal nearly linear parallel computational scaling across up to eight processing cores with efficiency rates above 70%, making the method well-suited for multi-core computational platforms. The approach demonstrates particular effectiveness in analyzing torsional stress patterns in thin-walled spherical components under both symmetric and asymmetric boundary scenarios, where traditional grid-based methods encounter discretization and convergence difficulties. The findings offer valuable practical recommendations for material specification and structural design enhancement, especially relevant for pressure vessel and dome structure applications experiencing torsional loads. However, the probabilistic characteristics of the method create statistical uncertainty that requires cautious result interpretation, and computational expenses may surpass those of deterministic approaches for less complex geometries. Engineering analysis of the outcomes provides actionable recommendations for optimizing material utilization and maintaining structural reliability under torsional loading conditions.
We summarize the requirements for sustainable development, basic resources, finiteness of human knowledge, and the needs leading to system requirements. We present a general hierarchy of systems, including natural and technical systems. Self-organizing systems are the highest in the hierarchy of technical systems, thus providing, in special cases, all other systems lower in the hierarchy. We do not have a single system theory, but many theories complement each other. A modern form of general theory of systems is the complexity theory, a spin-off of cybernetics and artificial intelligence and related to system dynamics. The system archetypes include, for example, feedback, optimization and decision-making, hierarchy, and degree of centralization. The most advanced systems are nonlinear and dynamic such as nonequilibrium systems representing different whirlpools in nature.
For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.
Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors)-parallel processors General Terms: Theory Additional Key Words and Phrases: Amdahl's law, massively parallel processing
speedup
Cr Categories
Subject Descriptors
CR Categories and Subject Descriptors: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures
(Multiprocessors)-parallel
processors
General Terms: Theory
Additional
Key Words and Phrases: Amdahl's law, massively parallel processing, speedup
Development and analysis of scientific application programs on a 1024-processor hypercube . SAND 88-0317. Sandia National Laboratories
Jan 1988
R E Benner
J L Gustafson
Benner. R.E.. Gustafson, J.L., and Montry. R.E. Development and
analysis of scientific application programs on a 1024-processor hypercube. SAND 88-0317. Sandia National Laboratories. Albuquerque. N.M.. Feb. 1988.
Development and analysis of scientific application programs on a 1024~processor hypercube