Joseph K. L. Lee’s research while affiliated with University of Edinburgh and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (14)


Sustainable AI: Experiences, Challenges & Recommendations
  • Conference Paper

November 2024

·

2 Reads

Eleanor Broadway

·

Joseph K. L. Lee

·

Michèle Weiland


Pragma driven shared memory parallelism in Zig by supporting OpenMP loop directives

September 2024

·

15 Reads

The Zig programming language, which is designed to provide performance and safety as first class concerns, has become popular in recent years. Given that Zig is built upon LLVM, and-so enjoys many of the benefits provided by the ecosystem, including access to a rich set of backends, Zig has significant potential for high performance workloads. However, it is yet to gain acceptance in HPC and one of the reasons for this is that support for the pragma driven shared memory parallelism is missing. In this paper we describe enhancing the Zig compiler to add support for OpenMP loop directives. Then exploring performance using NASA's NAS Parallel Benchmark (NPB) suite. We demonstrate that not only does our integration of OpenMP with Zig scale comparatively to Fortran and C reference implementations of NPB, but furthermore Zig provides up to a 1.25 times performance increase compared to Fortran.



Implementing OpenMP for Zig to enable its use in HPC context

August 2024

·

17 Reads

This extended abstract explores supporting OpenMP in the Zig programming language. Whilst, C and Fortran are currently the main languages used to implement HPC applications, Zig provides a similar level of performance complimented with several modern language features, such as enforcing memory safety. However, Zig lacks support for OpenMP which is the de facto threaded programming technology. Leveraging Zig's LLVM compiler tooling, we have added partial support for OpenMP to the Zig compiler and demonstrated that the performance attained by using Zig with OpenMP is comparable to, and in come cases exceeds, that of conventional HPC languages. Consequently we demonstrate that Zig is a viable and important programming technology to use for HPC, and this work paves the way for more HPC features to be added to Zig, ultimately providing HPC developers with the option of using a safer, more modern language for creating high performance applications.



Average throughput using the 1F1B-1 pipelining schedule and 2BP, with and without concatenating micro-batches during the backward-p2 step.
2BP: 2-Stage Backpropagation
  • Preprint
  • File available

May 2024

·

12 Reads

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

Download


Is RISC-V ready for HPC prime-time: Evaluating the 64-core Sophon SG2042 RISC-V CPU

September 2023

·

772 Reads

The Sophon SG2042 is the world's first commodity 64-core RISC-V CPU for high performance workloads and an important question is whether the SG2042 has the potential to encourage the HPC community to embrace RISC-V. In this paper we undertaking a performance exploration of the SG2042 against existing RISC-V hardware and high performance x86 CPUs in use by modern supercomputers. Leveraging the RAJAPerf benchmarking suite, we discover that on average, the SG2042 delivers, per core, between five and ten times the performance compared to the nearest widely available RISC-V hardware. We found that, on average, the x86 high performance CPUs under test outperform the SG2042 by between four and eight times for multi-threaded workloads, although some individual kernels do perform faster on the SG2042. The result of this work is a performance study that not only contrasts this new RISC-V CPU against existing technologies, but furthermore shares performance best practice.


Backporting RISC-V Vector Assembly

August 2023

·

15 Reads

·

8 Citations

Lecture Notes in Computer Science

Leveraging vectorisation, the ability for a CPU to apply operations to multiple elements of data concurrently, is critical for high performance workloads. However, at the time of writing, commercially available physical RISC-V hardware that provides the RISC-V vector extension (RVV) only supports version 0.7.1, which is incompatible with the latest ratified version 1.0. The challenge is that upstream compiler toolchains, such as Clang, only target the ratified v1.0 and do not support the older v0.7.1. Because v1.0 is not compatible with v0.7.1, the only way to program vectorised code is to use a vendor-provided, older compiler. In this paper we introduce the rvv-rollback tool which translates assembly code generated by the compiler using vector extension v1.0 instructions to v0.7.1. We utilise this tool to compare vectorisation performance of the vendor-provided GNU 8.4 compiler (supports v0.7.1) against LLVM 15.0 (supports only v1.0), where we found that the LLVM compiler is capable of auto-vectorising more computational kernels, and delivers greater performance than GNU in most, but not all, cases. We also tested LLVM vectorisation with vector length agnostic and specific settings, and observed cases with significant difference in performance.KeywordsRISC-V vector extensionHPCClangRVV Rollback


Citations (4)


... The pseudocode for the optimized algorithm for a lower triangular non-transposed matrix is given in Algorithm 4. It traverses the matrix "bottom-up". For the last columns, calculations are performed using the baseline algorithm (lines [4][5][6][7][8][9][10][11][12][13][14], and for the remaining columns, they are performed by an optimized algorithm that traverses along diagonals (lines 15-31). AXPY(length + 1, alpha * X[i], a + lda * i, Y + i); 7: end for 8: for (i = 0; i < iend; i += BLOCK_SIZE) do 9: y_copy =LOAD(y + i); 10: for (INT j = 0; j < k; j++) do 11: x_copy = LOAD(x + i + 1 + j); 12: diag_a = LOAD_WITH_STRIDE(a + 1 + j, STRIDE); 13: mul = MUL_VV(x_copy, diag_a); 14: y_copy = FMA_VF(y_copy, alpha, mul); 15: end for 16: SAVE(y + i, y_copy); 17: a += BLOCK_SIZE * lda; 18: end for 19: for (; i < n -k; i++) do 20: length = MIN(n -i -1, k); 21: Y[i] += alpha * DOT(length, a + 1, X + i + 1); 22: a += lda; 23: end for 24: for (; i < n; i++) do 25: length = MIN(n -i -1, k); 26: AXPY(length + 1, alpha * X[i], a + k -length, 27: Y + i -length); 28: Y[i] += alpha * DOT(length, a + 1, X + i + 1); 29: a += lda; 30: end for 31: } Depending on stored triangle of the matrix and whether the matrix is transposed or not, this operation is performed using the DOT or AXPY. ...

Reference:

Performance optimization of BLAS algorithms with band matrices for RISC-V processors
Is RISC-V ready for HPC prime-time: Evaluating the 64-core Sophon SG2042 RISC-V CPU
  • Citing Conference Paper
  • November 2023

... The work [12] examines existing and future RISC-V ISA extensions and demonstrates exciting prospects for the creation and HPC area. The paper [34] provided an overview of the stateof-the-art in vectorization on RISC-V and analyzed the capabilities of Allwinner D1 processors. The papers [53] and [22] showed the advantages of using long SIMD vectors on promising RISC-V processors. ...

Test-Driving RISC-V Vector Hardware for HPC
  • Citing Chapter
  • August 2023

Lecture Notes in Computer Science

... Paper [7] contains a performance evaluation of HPC workloads using FPGA simulation. The paper [8] focuses on the important problem of insufficient development of compilers for RISC-V in terms of code vectorization. Indeed, current compilers only target the RVV 1.0 vector extensions, while most of the available devices support only RVV 0.7.1. ...

Backporting RISC-V Vector Assembly
  • Citing Chapter
  • August 2023

Lecture Notes in Computer Science

... A vital extension is RVV, which implements the SIMD approach -single instruction multiple data, allowing parallel processing of data in vector registers. RVV enables the efficient processing of large data arrays, resulting in considerable performance benefits for HPC (High Performance Computing) tasks [8] and artificial intelligence applications [9,10]. Notably, the non-fixed length of RVV's vector register, determined by the VLEN parameter, simplifies development by allowing software to be written once and run on hardware with varying register lengths. ...

Test-driving RISC-V Vector hardware for HPC
  • Citing Preprint
  • April 2023