Sardar Anisul HaqueGeomechanica Inc. (Canada) & Qassim University (KSA)
5.21· PhD, University of Western Ontario
Skills and Expertise
AlgorithmsParallel and Distributed ComputingHigh Performance ComputingParallel ProgrammingScientific ComputingParallel ProcessingGPU ProgrammingScientific ComputationGPUComputational ScienceGPU-ComputingScientific ProgrammingCode DevelopmentParallel AlgorithmParallel & Distributed SystemsData Intensive ComputingGPGPUCompute Unified Device ArchitectureMulticore and Manycore
- Toronto, Ontario, Canada
- Visiting Scientist at Geomechanica Inc.
Nov 2015 - Feb 2017
- High Performance Computing
- Toronto, Ontario, Canada
- Ontario Centre of Excellence Postdoctoral Fellow
- Using high performance computing technics to solve problems from rock science. Technologies used: OpenCL and C++.
Geomechanica Inc. (Canada) & Qassim University (KSA)
- Visiting Scientist at Geomechanica Inc. & Assistant Professor at Qassim University
This R&D project focuses on the development, implementation and validation of a 3D geomechanical simulation software based on the finite-discrete element method (FDEM). With this software, named Irazu, the parallel processing power of general-purpose graphics processing units (GPGPUs) is leveraged to gain significant computational performance boosts compared to existing sequential FDEM codes.
Research Item (14)
This paper presents a novel implementation of a hydro-mechanically coupled, finite-discrete element method (FDEM) optimized to exploit the computing parallelism of graphics processing units (GPUs). A co-processing approach is adopted with the control loop of FDEM executed serially on the CPU and compute-intensive tasks off-loaded to the GPU. A benchmarking study indicates speedups of up to 100× compared to sequential CPU execution. The implementation is validated by comparing 3D laboratory-scale rock fracturing simulations with experimental results. The effectiveness of the approach for practical rock engineering applications is demonstrated through the back analysis of a slope in a fractured rock mass.
- Jan 2018
The CUDA Modular Polynomial (CUMODP) Library implements arithmetic operations for dense matrices and dense polynomials, primarily with modular integer coefficients. Some operations are available for integer or floating point coefficients. Similar to other software libraries, like CuBLAS ¹ targeting Graphics Processing Units (GPUs), CUMODP focuses on efficiency-critical routines and provides them in the form of device functions and CUDA kernels. Hence, these routines are primarily designed to offer GPU support to polynomial system solvers. A bivariate system solver is part of the library, as a proof-of-concept. Its implementation is presented in  and it is integrated in Maple's Triangularize command², since the release 18 of Maple.
- Jul 2017
- the International Workshop
We present multithreaded adaptations of the Euclidean plain division and the Euclidean GCD algorithms to the many-core GPU architectures We report on implementation with NVIDIA CUDA and complexity analysis with an enhanced version of the PRAM model.
- Sep 2014
- International Workshop on Computer Algebra in Scientific Computing
We propose parallel algorithms for operations on univariate polynomials (multi-point evaluation, interpolation) based on subproduct tree techniques and targeting many-core GPUs. On those architectures, we demonstrate the importance of adaptive algorithms, in particular the combination of parallel plain arithmetic and parallel FFT-based arithmetic. Experimental results illustrate the benefits of our algorithms.
- Aug 2014
- International Congress on Mathematical Software
CUMODP is a CUDA library for exact computations with dense polynomials over finite fields. A variety of operations like multiplication, division, computation of subresultants, multi-point evaluation, interpolation and many others are provided. These routines are primarily designed to offer GPU support to polynomial system solvers and a bivariate system solver is part of the library. Algorithms combine FFT-based and plain arithmetic, while the implementation strategy emphasizes reducing parallelism overheads and optimizing hardware usage.
- Feb 2014
We present a model of multithreaded computation, combining fork-join and single-instruction-multiple-data parallelisms, with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. We establish a Graham-Brent theorem for this model so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with four fundamental algorithms from scientific computing. In each case, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter; moreover experimentation confirms the model's prediction.
The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications in accessing data or rearrangement of data in order to reduce memory traffic, code optimization at all levels, designing parallel algorithms with smaller span or reduced overhead are some of the attractive areas that HPC researchers are working on. In this thesis, we investigate HPC techniques for the implementation of basic routines in computer algebra targeting hardware acceleration technologies. We start with a sorting algorithm and its application to sparse matrix-vector multiplication for which we focus on work on cache complexity issues. Since basic routines in computer algebra often provide a lot of fine grain parallelism, we then turn our attention to many-core architectures on which we consider dense polynomial and matrix operations ranging from plain to fast arithmetic. Most of these operations are combined within a bivariate system solver running entirely on a graphics processing unit (GPU).
As for serial code on CPUs, parallel code on GPUs for dense polynomial arithmetic relies on a combination of asymptotically fast and plain algorithms. Those are employed for data of large and small size, respectively. Parallelizing both types of algorithms is required in order to achieve peak performances. In this paper, we show that the plain dense polynomial multiplication can be efficiently parallelized on GPUs. Remarkably, it outperforms (highly optimized) FFT-based multiplication up to degree 212 while on CPU the same threshold is usually at 26. We also report on a GPU implementation of the Euclidean Algorithm which is both work-efficient and runs in linear time for input polynomials up to degree 218 thus showing the performance of the GCD algorithm based on systolic arrays.
We report on a GPU implementation of the condensation method designed by Abdelmalek Salem and Kouachi Said for computing the determinant of a matrix. We consider two types of coefficients: modular integers and floating point numbers. We evaluate the performance of our code by measuring its effective bandwidth and argue that it is numerical stable in the floating point number case. In addition, we compare our code with serial implementation of determinant computation from well-known mathematical packages. Our results suggest that a GPU implementation of the condensation method has a large potential for improving those packages in terms of running time and numerical stability.
Sparse matrix-vector multiplication or SpMxV is an important kernel in scientific computing. For example, in the conjugate gradient method, where SpMxV is the main computational step. Though the total number of arithmetic operations in SpMxV is fixed, reducing the probability of cache misses per operation is still a challenging area of research. In this work, we present a new column ordering algorithm for sparse matrices. We analyze the cache complexity of SpMxV when A is ordered by our technique. The numerical experiments, with very large test matrices, clearly demonstrate the performance gains rendered by our proposed technique.
We revisit ordering techniques as a preprocessing step for improving the performance of sparse matrix-vector multiplication (SpM$\times$V) on modern hierarchical memory computers. In computing SpM$\times$V the main purpose of ordering of columns (or rows) is to improve the performance by enhancing data reuse. We present a new ordering technique based on the binary reflected gray codes and experimentally evaluate and compare it with other column ordering techniques from the literature. The results from numerical experiments with very large test matrices clearly demonstrates the performance gains rendered by our proposed technique.
The paper presents a genetic algorithm with fuzzy logic controller for determining opportunistic replacement policy for deteriorating components of an equipment or system. An opportunistic replacement model has been formulated by considering the dynamics of the decision process of such a policy. In order to reduce the computational burden involving complete enumeration of all possible policies, genetic algorithm has been used to find near optimal solution by maximizing net benefit to be gained from an opportunistic replacement. A fuzzy logic controller has been used to automatically adjust the fine-tuning structure of genetic algorithm parameters. The performance of the model and the solution procedure has been evaluated for a number of case problems, which clearly demonstrates that the proposed method is very effective.
Abstract The efficiency of linear algebra operations for sparse matrices on modern,high performance computing,system is often constrained by the available memory,bandwidth. We are inter- ested in sparse matrices whose sparsity pattern is unknown. In this thesis, we study the efficiency of major storage schemes,of sparse matrices during multiplication with dense vector. A proper reordering of columns,or rows usually results in reduced memory,traf- fic due to the improved data reuse. This thesis also proposes an efficient column,ordering algorithm based on binary reflected gray code. Computational experiments show that this ordering results in increased performance,in computing,the product of a sparse matrix with a dense vector. iv Acknowledgments
Summary Recent biological researches have corroborated that gene sequence variants has a rule for the development and progression of common diseases. Some technological constraints restricts us from collecting haplotype data directly, instead we collect genotype data. To infer haplotype data from genotype data, Haplotype Inference By Pure Parsimony which minimizes the number of distinct haplotypes to explain certain number of genotype, is a good option. HIPP can be reduced to equivalent boolean satisfiablity problem. The performance of this approach depends the choice of branching rules and preprocessing steps dramatically. In this paper, we experiment on different combination of preprocessing choices and branching rules of SAT solver. This paper proposes a solution to the HIPP problem, based on this SAT model implemented on a distributed environment. Keeping the complexity of the search problem in mind, we developed SAT solver on distributed environment. And at the upshot of our work, we tested some problem instances under the combination of six branching rules and three pre-processing to give the decision which variant of the SAT model is best for HIPP.