Figure 1 - uploaded by Mrunal Gawade
Content may be subject to copyright.
Schematic diagram for Intel Xeon E54657LV2 @2.40GHz CPU in four and eight socket CPUs it is considerably expensive, if the memory being accessed is remote. Figure 1 shows a shared memory four socket CPU system where each CPU socket is associated with its own memory module (DRAM), and can also access a remote DRAM through Quick Path Interconnect (QPI) [3]. The memory access latency thus varies considerably based on whether the memory being accessed is local or remote. For example, the process residing on socket 0 accesses its local memory much faster than the remote memory on socket 2, as socket 2 is 2 hops away from socket 0. Non-uniform memory access (NUMA) [11] is thus a result of different memory access latency across sockets, in a shared memory system. The graph in Figure 2 shows such an example for TPC-H Q1 (Scale factor 100 GB on four socket CPU). We plot an average of 6 runs (minimal variations are observed between consecutive runs), clearing the buffer cache between independent query executions. The database server process (using memory mapped storage) is allowed to execute strictly only on two sockets (0 and 1), by pinning the process's affinity to both sockets, using the tool numactl [5]. On the other hand, the memory allocation for the process is allowed to take place on different sockets (0 to 3), using numactl's memory binding option, to emphasize that the locality of data and the memory access distance matters, and affects the execution time. When the memory allocation is local (socket 0 and 1), the execution time is lowest, as there is minimal cross socket data access. The execution time is highest when the memory allocation occurs only on socket 2, as memory on socket 2 is 2 hops away from socket 0, and 1 hop away from socket 1. The operating system does not allocate memory pages in an uniform manner across sockets, hence the part of the process executing on socket 0 tends to access more pages, compared to process execution on socket 1. This also explains why the

Schematic diagram for Intel Xeon E54657LV2 @2.40GHz CPU in four and eight socket CPUs it is considerably expensive, if the memory being accessed is remote. Figure 1 shows a shared memory four socket CPU system where each CPU socket is associated with its own memory module (DRAM), and can also access a remote DRAM through Quick Path Interconnect (QPI) [3]. The memory access latency thus varies considerably based on whether the memory being accessed is local or remote. For example, the process residing on socket 0 accesses its local memory much faster than the remote memory on socket 2, as socket 2 is 2 hops away from socket 0. Non-uniform memory access (NUMA) [11] is thus a result of different memory access latency across sockets, in a shared memory system. The graph in Figure 2 shows such an example for TPC-H Q1 (Scale factor 100 GB on four socket CPU). We plot an average of 6 runs (minimal variations are observed between consecutive runs), clearing the buffer cache between independent query executions. The database server process (using memory mapped storage) is allowed to execute strictly only on two sockets (0 and 1), by pinning the process's affinity to both sockets, using the tool numactl [5]. On the other hand, the memory allocation for the process is allowed to take place on different sockets (0 to 3), using numactl's memory binding option, to emphasize that the locality of data and the memory access distance matters, and affects the execution time. When the memory allocation is local (socket 0 and 1), the execution time is lowest, as there is minimal cross socket data access. The execution time is highest when the memory allocation occurs only on socket 2, as memory on socket 2 is 2 hops away from socket 0, and 1 hop away from socket 1. The operating system does not allocate memory pages in an uniform manner across sockets, hence the part of the process executing on socket 0 tends to access more pages, compared to process execution on socket 1. This also explains why the

Source publication
Conference Paper
Full-text available
With the rise of multi-socket multi-core CPUs a lot of effort is being put into how to best exploit their abundant CPU power. In a shared memory setting the multi-socket CPUs are equipped with their own memory module, and access memory modules across sockets in a non-uniform access pattern (NUMA). Memory access across socket is relatively expensive...

Contexts in source publication

Context 1
... diagram for Intel Xeon E5- 4657LV2 @2.40GHz CPU in four and eight socket CPUs it is considerably expensive, if the memory being accessed is remote. Figure 1 shows a shared memory four socket CPU system where each CPU socket is associated with its own mem- ory module (DRAM), and can also access a remote DRAM through Quick Path Interconnect (QPI) [3]. The memory access latency thus varies considerably based on whether the memory being accessed is local or remote. ...
Context 2
... contrast, the promising query execution time for the queries 4, 6, and 19 by MonetDB's NUMA Distr approach prompted us to explore more queries. We plot their execu- tion time in Figure 10. It shows that the query execution performance of MonetDB's NUMA Distr approach is com- parable to Hyper's parallel execution performance for the query set under evaluation. ...
Context 3
... shows that the query execution performance of MonetDB's NUMA Distr approach is com- parable to Hyper's parallel execution performance for the query set under evaluation. In Figure 10 MonetDB uses 96 threads in total. To match Hyper's hardware configuration we restricted MonetDB's execution to 64 threads. ...
Context 4
... match Hyper's hardware configuration we restricted MonetDB's execution to 64 threads. Even with this change MonetDB's NUMA Distr numbers do not show much variations compared to execution times in Figure 10. ...

Similar publications

Article
Full-text available
ARINC-653, a standard of avionics software platform, defines the sampling communication port that provides only the latest message while discarding old messages. This sampling port is particularly efficient at transmitting sensing data that reflects the actual state of target system without message queueing delay. In this paper, we implement the AR...

Citations

... In a shared memory setting the multi-socket CPUs are equipped with their own memory module, and access memory modules across sockets in a NUMA pattern [2]. Memory access across socket is relatively expensive compared to memory access within a socket [2]. ...
... In a shared memory setting the multi-socket CPUs are equipped with their own memory module, and access memory modules across sockets in a NUMA pattern [2]. Memory access across socket is relatively expensive compared to memory access within a socket [2]. And although the interconnect interfaces between CPU sockets are much faster than what they were, CPUs are also much faster, their built-in memory controllers are also faster, and memory is also faster. ...
... A recent prototype of SAP Hana implements an adaptive algorithm to decide between data placement and thread stealing when load imbalance is detected [5] (Li et al. [20] presented a similar tradeoff strategy). However, the number of threads changes based on the core utilization opposed to MonetDB [24] and SQL Server [6] that bound the number of worker threads and maximum data partitions to the number of cores available per socket. MonetDB and SQL Server implement similar vector structures internally (i.e., BAT) to boost parallel access to disjoint partitions of columns (discussed in Section II). ...
Conference Paper
Full-text available
During the parallel execution of queries in Non-Uniform Memory Access (NUMA) systems, the Operating System (OS) maps the threads (or processes) from modern database systems to the available cores among the NUMA nodes using the standard node-local policy. However, such non-smart mapping may result in inefficient memory activity, because shared data may be accessed by scattered threads requiring large data movements or non-shared data may be allocated to threads sharing the same cache memory, increasing its conflicts. In this paper we present a data-distribution aware and elastic multi-core allocation mechanism to improve the OS mapping of database threads in NUMA systems. Our hypothesis is that we mitigate the data movement if we only hand out to the OS the local optimum number of cores in specific nodes. We propose a mechanism based on a rule-condition-action pipeline that uses hardware counters to promptly find out the local optimum number of cores. Our mechanism uses a priority queue to track the history of the memory address space used by database threads in order to decide about the allocation/release of cores and its distribution among the NUMA nodes to decrease remote memory access. We implemented and tested a prototype of our mechanism when executing two popular Volcano-style databases improving their NUMA-affinity. For MonetDB, we show maximum speedup of 1.53×, due to consistent reduction in the local/remote per-query data traffic ratio of up to 3.87× running 256 concurrent clients in the 1 GB TPC-H database also showing system energy savings of 26.05%. For the NUMA-aware SQL Server, we observed speedup of up to 1.27× and reduction on the data traffic ratio of 3.70×.
... MCC-DB classi es queries in cache-sensitive and cacheinsensitive to feed the query execution scheduler. Similar to our mechanism, in [5] NUMA cores are allocated one by one to mitigate access to memory banks in distant nodes when the OS tries to keep data locality. However, these approaches are intrusive requiring modi cations in the source-code of the DBMS (e.g., MonetDB and PostgreSQL). ...
Conference Paper
Full-text available
In the parallel execution of queries in Non-Uniform Memory Access (NUMA), the operating system maps database processes/threads (i.e., workers) to the available cores across the NUMA nodes. How- ever, this mapping results in poor cache activity with many minor page faults and slower query response time when workers and data are allocated in di erent NUMA nodes. e system needs to move large volumes of data around the NUMA nodes to catch up with the running workers. Our hypothesis is that we mitigate the data movement to boost cache hits and response time if we only hand out to the system the local optimum number of cores instead of all the available ones. In this paper we present a PetriNet mechanism that represents the load of the database workers for dynamically computing and allocating the local optimum number of CPU cores to tackle such load. Preliminary results show that data movement diminishes with the local optimum number of CPU cores.
... Figure 4b shows that when the same experiment is repeated on the 4 socket NUMA machine on a 100GB dataset, the results are quite different. No explicit NUMA aware data partitioning is used as MonetDB uses memory mapped storage [13]. Execution with up to 48 threads uses the physical threads (12 threads on each socket with numactl [2] process and memory affinity), whereas 72 and 96 threaded execution also uses the hyper-threads. ...
Conference Paper
Full-text available
Columnar database systems, designed for an optimal OLAP workload performance, strive for maximum multi-core utilization under concurrent query executions. However, multi-core parallel plan generated for isolated execution leads to suboptimal performance during concurrent query execution. In this paper, we analyze the concurrent workload resource contention effects on multi-core plans using three intra-query parallelization techniques, static, adaptive, and cost model parallelization. We focus on a plan level comparison of selected TPC-H queries, using in-memory multi-core columnar systems. Excessive partitions in statically parallelized plans result into heavy L3 cache misses leading to memory contention, degrading query performance severely. Overall, adaptive plans show more robustness, less scheduling overheads, and an average 50% execution time improvement compared to statically parallelized plans, and cost model based plans.
... The execution time for both two and four socket machine shows similar time, which indicates minimal NUMA effects. As authors in [14] observe, since MonetDB uses a memory mapped representation for the buffer data, as the number of partitions increase, we expect them to get assigned to the memory modules of Figure 17: Isolated execution performance of TPS-DS queries on a) 2 socket machine with 2.00 GHz CPU b) 4 socket machine with 2.40 GHz CPU, on 100GB data. the sockets on which operator execution gets scheduled. ...
Chapter
An array database is a software that uses non-linear data structures to store and process multidimensional data, including images and time series. As multi-dimensional data applications are generally data-intensive, array databases can benefit from multi-processing systems to improve performance. However, when dealing with Non-Uniform Memory Access (NUMA) machines, the movement of massive amounts of data across NUMA nodes may result in significant performance degradation. This paper presents a mechanism for scheduling array database threads based on data movement patterns and performance monitoring information. Our scheduling mechanism uses non-cooperative game theory to determine the optimal thread placement. Threads act as decision-makers selecting the best NUMA node based on each node’s remote memory access cost. We implemented and tested our mechanism on two array databases (Savime and SciDB), demonstrating improved NUMA-affinity. With Savime, we observed a maximum speedup of 1.64×1.64\times and a consistent reduction of up to 2.46×2.46\times in remote data access during subarray operations. With SciDB, we observed a speedup of up to 1.38×1.38\times and a reduction of 1.71×1.71\times in remote data access.KeywordsArray databasesThread pinningNash equilibriumQuery processing