Juliusz Sompolski’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


Figure 3: The flow network (bipartite graph) model used to determine the responsibility assignment.  
Figure 4: DXchg Operator Implementation on high-speed Infiniband networks (where MPI provides a native binding, avoiding slow TCP/IP emulation). DXchg producers send messages of a fixed size (>= 256KB for good MPI throughput) to consumers and use doublebuffering, such that communication overlaps with processing (i.e. partitioning + serialization). Tuples are serialized into MPI message buffers in a PAX-like layout, such that Receivers can return vectors directly out of these buffers with minimal processing and no extra copying. It is often the case that the sets of producer and consumer nodes are overlapping, or even the same. As an optimization, for intra-node communication we only send pointers to sender-side buffers in order to avoid the full memcpy() that MPI would otherwise need to do behind the scenes. For the two partitioning DXchg variants, DXchgHashSplit (DXHS) and DXchgRangeSplit (DXRS), the original implementations, described in [7], used a thread-to-thread approach. Each DXchg sender partitions its own data with a fanout that is equal to the total number of receiver threads across all nodes. For senders to operate independently, they each require f anout private buffers. Typically, the nodes in the cluster are homogeneous (same num cores). Hence, the partitioning fanout equates to num nodes * num cores, and considering double-buffering and the fact that each node has num cores DXchg threads active, this results in a pernode memory requirement of 2 * num nodes * num cores 2. As VectorH gets deployed on larger systems with typically tens and sometimes more than 100 nodes, each having often around 20 cores, partitioning into 100*20=2000 buffers becomes more expensive due to TLB misses. More urgently, this could lead to buffering many GBs (2 * 100 * 20 2 * 256KB = 20GB). Not only is this excessive, but the fact that the buffers are only sent when full or when the input is exhausted (the latter being likely in case of 20GB buffer space per node, leading also to small MPI message sizes) can make DXchg a materializing operator, reducing parallelism. Therefore, we implemented a thread-to-node approach, where rather than sending MPI messages to remote threads directly, these get sent to a node, reducing the fanout to num nodes and the buffering requirements to 2 * num nodes * num cores. The sender includes a one-byte column that identifies the receiving thread for each tuple. The exchange receiver lets consumer threads selectively consume data from incoming buffers using the one-byte-column. The thread-tonode DXchg is more scalable than the default implementation, yet on low core counts and small clusters the thread
VectorH: Taking SQL-on-Hadoop to the Next Level
  • Conference Paper
  • Full-text available

June 2016

·

558 Reads

·

16 Citations

·

·

Adrian Ionescu

·

[...]

·

Giel de Nijs

Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.

Download

Figure 1: Project micro-benchmark: with and without {compilation, vectorization, SIMD}. The "SIMD-sht" lines work around the alignment suboptimality in icc SIMD code generation.
Vectorization vs. compilation in query execution

June 2011

·

731 Reads

·

75 Citations

Compiling database queries into executable (sub-) programs provides substantial benefits comparing to traditional interpreted execution. Many of these benefits, such as reduced interpretation overhead, better instruction code locality, and providing opportunities to use SIMD instructions, have previously been provided by redesigning query processors to use a vectorized execution model. In this paper, we try to shed light on the question of how state-of-the-art compilation strategies relate to vectorized execution for analytical database workloads on modern CPUs. For this purpose, we carefully investigate the behavior of vectorized and compiled strategies inside the Ingres VectorWise database system in three use cases: Project, Select and Hash Join. One of the findings is that compilation should always be combined with block-wise query execution. Another contribution is identifying three cases where "loop-compilation" strategies are inferior to vectorized execution. As such, a careful merging of these two strategies is proposed for optimal performance: either by incorporating vectorized execution principles into compiled query plans or using query compilation to create building blocks for vectorized processing.

Citations (2)


... Our experiments were performed on a system equipped with 4 Intel Xeon 8376H CPUs (56 threads each, across 4 NUMA nodes), 768 GB DDR4 RAM (24 DIMMs), 1537 GB Intel Optane Persistent Memory 200 Series (24 NV-DIMMs of 128 GB), and 6 Intel NVMe SSDs (2.5 TB each), running on Linux 5.4 and Ubuntu 20.04. We integrated the distributed join algorithms into the commercial Actian Vector DBMS [10], built on the X100 [7] analytical database engine. For MPI, we use Intel MPI 2019. ...

Reference:

So Far and yet so Near - Accelerating Distributed Joins with CXL
VectorH: Taking SQL-on-Hadoop to the Next Level

... Recent work has explored its use for ML [28] and its translation to SQL [6,50], though this sort of "pure" relational implementation cannot compete with a tensor-relational implementation in terms of performance. There is a high computational overhead to push each tuple through a relational system [30,34,49]. The carefully-designed memory access patterns in an array-based kernel means that a CPU or GPU operates at close to its optimal floating-point operation per second rate, with little overhead. ...

Vectorization vs. compilation in query execution