ArticlePDF Available

Design of a High Performance Vector Processor Based on RISIC-V Architecture

Authors:

Abstract and Figures

This paper proposes a high performance Vector processor based on the high performance Embedded Core which is named TS800. The TS800 is a 4-core processor based on RISC-V architecture, implements IMAFDV instruction set, supports L2 Cache, branch prediction, sequential pipeline, and dual-issue structure. The traditional CPU mainly supports Scalar calculations, or only supports Vector calculations. For applications such as image and signal processing, there are a large number of data parallel computing operations. To solve the problem of low performance of parallel data calculations in industrial power applications, it is proposed to add VPU hardware implementation in the TS800. The TS800 can support FFT algorithm, adaptive controllers Reinforcement learning and learning-based underlying algorithm requirements. In this paper, the module and data flow between each processing unit and the control circuit, that is, the hardware realization of VPU module are proposed. Large-area units such as float arithmetic, multiplication and division are multiplexed with the Scalar operator in the CPU, while the control circuit is placed in the VPU-ALU, and the area is small. Units such as arithmetic and logic operation instructions, shift operation instructions, comparison operation instructions, and permutation instruction are implemented through the VPU-ALU, which makes the overall design area smaller and the performance better. At the same time, through the fir, fft, conv, matrix, Signal Converge and variance test, it is proved that while executing the same program, the running time of the cpu only with Scalar is 1.44 to 9.55 times that of the CPU with Vector module, which can support the underlying algorithm of the adaptive controller.
Content may be subject to copyright.
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
Design of a High Performance Vector Processor
Based on RISIC-V Architecture
To cite this article: Xiaojing Han
et al
2023
J. Phys.: Conf. Ser.
2560 012027
View the article online for updates and enhancements.
You may also like
GPU-based high-performance computing
for radiation therapy
Xun Jia, Peter Ziegenhein and Steve B
Jiang
-
GPU-accelerated simulations of isolated
black holes
Adam G M Lewis and Harald P Pfeiffer
-
Multi-Threaded Algorithms for GPGPU in
the ATLAS High Level Trigger
P. Conde Muíño and on behalf of the
ATLAS Collaboration
-
This content was downloaded from IP address 213.188.67.159 on 26/08/2023 at 13:34
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
1
Design of a High Performance Vector Processor Based on
RISIC-V Architecture
Xiaojing Han1,a*, Liang Liu2,b, Zhe Zhang1,c, Yufeng Sun1,d, Jiahui Zhou1,e, Hao
Cai3,f
1Chip Technology Department, Beijing Smartchip Microelectronics Technology CO.,
LTD, Beijing, China
2Digital Chip Design Center, Beijing Smartchip Microelectronics Technology CO.,
LTD, Beijing, China
3Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co.,
Ltd, Nanjing, China
ahxj19870216@126.com; bliuliang@sgchip.sgcc.com.cn;
czhangzhe1@sgchip.sgcc.com.cn; dsunyufeng@sgchip.sgcc.com.cn;
ezhoujiahui@sgchip.sgcc.com.cn; f21123016@qq.com
* Corresponding author
Abstract. This paper proposes a high performance Vector processor based on the high
performance Embedded Core which is named TS800. The TS800 is a 4-core processor based
on RISC-V architecture, implements IMAFDV instruction set, supports L2 Cache, branch
prediction, sequential pipeline, and dual-issue structure. The traditional CPU mainly supports
Scalar calculations, or only supports Vector calculations. For applications such as image and
signal processing, there are a large number of data parallel computing operations. To solve the
problem of low performance of parallel data calculations in industrial power applications, it is
proposed to add VPU hardware implementation in the TS800. The TS800 can support FFT
algorithm, adaptive controllers Reinforcement learning and learning-based underlying
algorithm requirements. In this paper, the module and data flow between each processing unit
and the control circuit, that is, the hardware realization of VPU module are proposed. Large-
area units such as float arithmetic, multiplication and division are multiplexed with the Scalar
operator in the CPU, while the control circuit is placed in the VPU-ALU, and the area is small.
Units such as arithmetic and logic operation instructions, shift operation instructions,
comparison operation instructions, and permutation instruction are implemented through the
VPU-ALU, which makes the overall design area smaller and the performance better. At the
same time, through the fir, fft, conv, matrix, Signal Converge and variance test, it is proved
that while executing the same program, the running time of the cpu only with Scalar is 1.44 to
9.55 times that of the CPU with Vector module, which can support the underlying algorithm of
the adaptive controller.
Keywords. CPU, RISIC-V, VPU, parallelism, performance
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
2
1. Introduction
RISC-V [1] is an open standard instruction set architecture (ISA) based on established RISC principles.
Unlike most other ISA designs, RISC-V [2] is provided under open source licenses that do not require
fees to use.
With the development of the RISC-V ecosystem, it will be more widely used in embedded or
emerging IoT, edge computing, artificial intelligence and other fields.
The traditional CPU mainly supports Scalar calculations, or only supports Vector calculations.
However, for applications such as image [3] and signal processing, there are a large number of data
parallel computing operations. To solve the problem of low performance of parallel data calculations in
industrial power applications, it is proposed to add VPU hardware implementation in the TS800.
SIMD [4] technology initially splits the data of 64-bit registers into multiple 8-bit, 16-bit, and 32-bit
forms to realize parallel computing of byte, half word, and word type data; in the future, in order to
further increase the parallelism of computing, SIMD technology [5] began to meet the application's
demand for computing power by increasing the bit width of the register. For traditional SIMD
technology [6] Intel's MMX, SSE series, AVX series, and ARM's Neon [7] architecture are all
representatives.
Another way to improve data parallelism [8] is Vector computing technology. The V (Vector)
extended instruction set of RISC-V is a simplified Vector instruction set. Its length is flexible and
variable, and it has good support and application for linear algebra operations such as multimedia.
Vector computing technology is a technology that decouples hardware and software more, is more
friendly to programmers. From the comparison between SIMD [9] and Vector, we can see that since the
instruction set limits the data operation bit width, each expansion of the parallelism [10] of the
hardware means the expansion of the instruction set and the rewriting of the code [11], which will add
more extra labor and affect the development.
Due to the demand in the field of industrial control, we developed a CPU core based on RISC-V
architecture, named TS800. To solve the problem of low performance of parallel data calculations [12]
in industrial power applications, it is proposed to add a Vector processor hardware implementation
named VPU in the TS800, which can support adaptive controllers Reinforcement learning and learning-
based [13] underlying algorithm requirements. This design can support FFT and IFT of 64~4096 point
[14].
In the following chapter 2, we will introduce the implementation of VPU module and the design
method of hardware circuit optimization area. In chapter 3, we will introduce the verification of VPU
module. In chapter 4, we will introduce the performance of VPU module. The last chapter 5 is the
conclusion.
2. Vector processor implementation
2.1. TS800 architecture
The TS800 is a RISC-V processor that supports multi-core and multi-cluster, fully supports RISC-V
Vector extension standard and the TS800 Intelligence Extensions, and is optimized for edge AI/ML/RL
computing. The TS800 is a 4-core processor based on RISC-V architecture, implements IMAFDV
instruction set, supports L2 Cache, branch prediction, sequential pipeline, and dual-issue structure. The
TS800 is ideal for applications requiring high-throughput, single-threaded performance, and power-
constrained applications such as AR/VR, sensor hubs, IVI systems, IP cameras, digital cameras, gaming
rigs, etc. The TS800 architecture block diagram is shown in Figure 1, the red box is the Vector
processor part, named as VPU.
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
3
TS800
Core3
Core2
Core1
PLIC Debug, Trace Timers PMU
Core0
VPU
LSU
FB
L1 I Cache L1 D Cache
Scheduler
ALU WB & Exception
handler
L2
L2 Cache SCU ACE master ACP slave
CLK RESET
FPU
Figure 1. Taishan 800 architecture
2.2. VPU architecture
The VPU module architecture block diagram is shown in Figure 2.
The VPU module consists of 8 main functional modules which is Specifically by the blue frame
diagram:
V_IB,V_DEC,V_LAU,V_LS,V_ARITH_CTRL[0]~V_ARITH_CTRL[3],V_ALU[0]~V_ALU[3],V_
WB and V_VRF.
Interactive interface which is shown as orange box diagram is as follows: signal interface with
SCHDULE module, signal interface with integer int ALU module, signal interface with floating point
FPU module, signal interface with LSU module and write back to the signal interface of the WB
module.
The data flow interaction with other modules of the main pipeline is as follows. For integer
multiplication and division operations, due to the multiplexing of the underlying operation unit and the
integer pipeline, V_ALU outputs control signals and data to the int ALU module. After the operation is
completed, int ALU returns the result to V_ALU, which is processed by V_ALU and then sent to the
WB module.
Figure 2. VPU module architecture
The function, data flow and control circuit of each module are as follows:
V_IB is a Vector instruction buffer module. It is used to receive Vector instructions from the
mainstream pipeline scheduling module, record and maintain lock/unlock, valid/invalid, PC information
of Vector instructions.
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
4
V_DEC is a Vector instruction decoding module. It is used to complete the decoding of the detailed
information of the Vector instruction.
V_LAU is a Vector instruction scheduling module. According to the information input in the WB
register, the Vector access module V_LS, the busy and idle conditions of the three operation channels
(ARITH_CTRL module), and the classification of operation instructions, V_LAU schedules operation
instructions.
V_LS is a Vector memory access control module. It splits the operation of the current memory
access instruction according to the input CSR and element mask information, and reads/writes up to
128bits of data each time. After multiple memory accesses, a complete memory access instruction is
completed. V_ARITH_CTRL is the Vector operation instruction execution control module.
V_ARITH_CTRL splits the operation to be performed by the current Vector instruction according to
the configured CSR and one-time operation capability. Read the corresponding VRF for each operation,
analyze the specific operation type, and send the operation request and operands to the operation unit
V_ALU.
V_ALU is a Vector arithmetic logic operation control module. The fixed-floating-point arithmetic
logic operation of the whole Vector is completed by multiple modules. The fixed-point multiplication
and division operations are completed in the ALU of the fixed-point pipeline. Other fixed-point
operations are completed in V_ALU. Floating-point operations are completed in the ALU of the
floating-point pipeline. V_ALU completes the centralized control and scheduling of these ALUs, as
well as the calculation function of the ALU inside the Vector.
V_WB is the Vector write back module. It is used to receive the operation result output by V_ALU,
process the memory access and write back request of V_LS, and complete the four write channel logics
of the heap Vector register file according to the MASK and tail write strategy of the current instruction.
According to the write response and the execution status of the current command, the retire of the
command is completed.
V_VRF is the Vector register file module. It completes the write operation to the Vector register by
receiving 4 write channels of V_WB. V_VRF completes the read operation of the Vector register file
by receiving 8 read channels in V_ARITH_CTRL and V_LS.
For floating-point operations, the underlying algorithm is implemented by FPU. V_ALU sends data
and control signals to FPU. After the operation is completed, float module returns the result to V_ALU,
which is processed by V_ALU and then sent to the WB module. Final instruction graduates in the
mainstream pipeline.
2.3. Vector instructions categories
Vector instructions are divided into three categories: configuration, operation, and memory access.
Among them, the operation and memory access instructions are implemented in the VPU, and the
configuration (vset) instructions are implemented in the mainstream pipeline. These three classes are
executed serially.
Vector instructions flow through the fixed-point pipeline and are retired at the write-back stage of
the fixed-point pipeline. The value of the fixed-floating-point general-purpose register required in the
Vector instruction is provided by the main pipeline when the Vector instruction is issued, and is used as
the input of the module.
2.4. VPU module area optimization scheme
In order to optimize the area and performance of the VPU hardware circuit, this paper implements
circuit multiplexing for the VPU implementation. Large-area units such as float arithmetic,
multiplication and division are multiplexed with the Scalar operator in the CPU, while the control
circuit is placed in the VPU-ALU, and the area is small. Units such as arithmetic and logic operation
instructions, shift operation instructions, comparison operation instructions, and permutation instruction
are implemented through the VPU-ALU, which makes the overall design area smaller and the
performance better.
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
5
Instructions such as operations can be parallelized in the Vector processor, and consecutive memory
access instructions are serialized.
By multiplexing the FPU and int ALU circuits, we can see that the area can be reduced by 19.3% in
Table 1. Table 1. VPU area compared with optimized
modules
percentage of total area
optimized percentage of total area
VPU
17.70%
17.70%
FPU
13.90%
0%
int ALU
15.40%
0%
total percent
47.00%
17.70%
3. Verification
The verification is based on the UVM verification platform. The result check adopts the reference
model, scoreboard method, and there is a clear self-check process in each case. Using CDV (coverage
driven verification) verification methodology, on the basis of random use cases, direct use cases are
added to achieve 100% functional coverage and code coverage.
Vector instructions can be categorized as follows: arithmetic operation instructions, logic operation
instructions, shift operation instructions, comparison operation instructions, permutation instructions
and transfer instructions between registers and DRAM.
The performance of the CPU with only the Scalar operation unit and the performance results of the
CPU with the Vector are shown as follows. The fir execution process and simulation results are
illustrated by performing the arithmetic instructions as an example. In Scalar instructions, there are no
parallel execution units inside the controller. In Vector instructions, there are both Scalar instructions
and Vector instructions. The simulation result are given in this section by simulating fir programs
applying VCS.
3.1. Scalar instruction simulation
The testing codes is shown in Figure 3.
Figure 3. Scalar testing codes
Simulation results are shown in Figure 4.
Figure 4. Simulation results of Scalar instructions
3.2. Vector and Scalar instruction simulation
The testing codes of Vector and Scalar is shown in Figure 5.
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
6
Figure 5. Vector and Scalar testing codes
Simulation results of Vector and Scalar instruction are shown in Figure 6.
Figure 6. Simulation results of Vector and Scalar instructions
4. Performances
4.1. TS800 timing optimization results
Through the fir, fft, conv, matrix, Signal Converge and variance test, it is proved that while executing
the same program, the running time of the cpu only with Scalar is 1.44 to 9.55 times that of the CPU
with Vector module. It is shown in Table 2.
Table 2. Parrel processing clk cycles
arithmetic
Vector/clk cycles
no Vector/Vector
fir
157,879
4.90
fft
567,172
2.92
conv
4,005,817
1.86
matrix
337,254
1.44
Signal Converge
971,422
9.55
variance
13,587
1.75
4.2. TS800 area optimization results
The layout of the chip are shown in Figure 7.We can see that the red block is VPU. It is 17.7% of the all
chip.
Figure 7. Simulation results of Vector and Scalar instructions
The area of the TS800 is shown in Table 3.
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
7
Table 3. TS800 areas results
modules
sub-modules
areas
percentage of
total area
percentage of
sub-module
m800
/
719544.4
100
/
FPU
ALL
100007.5
13.9
100
fpu_alu0
10936.2
1.5
10.8
fpu_alu1
10943
1.5
10.8
fpu_div0
9882.5
1.4
10.1
fpu_div1
9731.7
1.4
10.1
fpu_mul0
14654.1
2
14.4
fpu_mul1
14451.2
2
14.4
fpu_regbank
21057.1
2.9
20.9
VPU
ALL
127560,1
17.7
100
vpu_valu0
7046.1
1
5.6
vpu_valu1
13586
1.9
10.7
vpu_valu2
17005.5
2.4
13.6
vpu_varthctl0
5346.9
0.7
4
vpu_varthctl1
6079.3
0.8
4.5
vpu_varthctl2
12427
1.7
9.6
vpu_vdec
1023
0.1
0.5
vpu_vib
2747.3
0.4
2.3
vpu_vlau
736.7
0.1
0.5
vpu_vlsu
14188.3
2
11.2
vpu_vwb
46571.3
6.5
36.7
int_ALU
ALL
111124.6
15.4
100
mul0
14736
2
13.26
mul1
14557
2
13.1
reg
13005
1.8
11.7
div0
24349
3.38
21.91
div1
24229
3.37
21.8
others
20335.8
2.83
18.3
bht_bim_ram0
/
886.4
0.1
/
bht_bim_ram1
/
886.4
0.1
/
bht_meta_ram0
/
886.4
0.1
/
bht_meta_ram1
/
886.4
0.1
/
bht_ram0
/
1616.1
0.2
/
bht_ram1
/
1616.1
0.2
/
genblk1_u_data_ram0
/
51545
7.2
/
genblk1_u_data_ram1
/
51545
7.2
/
genblk1_u_tag_ram0
/
3398.4
0.5
/
genblk1_u_tag_ram1
/
3397
0.5
/
bpu
/
8188.7
1.1
/
cpu_reset
/
24.8
0
/
fu
/
16350.4
2.3
/
intrpp
/
101.2
0
/
lsu
/
203042.9
28.2
/
sch
/
10596.8
1.5
/
ts800_trig_top
/
7280.6
1
/
wb_top
/
16828.9
2.3
/
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
8
In order to optimize the area and performance of the VPU hardware circuit, this paper implements
circuit multiplexing for the VPU implementation. Comparison with optimized area result is shown in
Table 1. By multiplexing the FPU and int ALU circuits, the area can be reduced by 19.3%.
5. Conclusions
For applications such as image and signal processing, there are a large number of data parallel
computing operations.To solve the problem of low performance of parallel data calculations in
industrial power applications, it is proposed to add VPU hardware implementation in the TS800.
Taking the key technology of data parallel architecture as the research object, with parallel computing
as the technical support, the VPU hardware circuit requirements of the TS800 are put forward, and it is
realized through system verilog and verified by UVM environment, and the data comparison of the
application is realized on the hardware accelerator, the results of simulation verification show that the
design of VPU based on RISIC-V architecture is logically feasible. By multiplexing the FPU and int
ALU circuits, in Table 1. the area can be reduced by 19.3%. Through the fir, fft, conv, matrix, Signal
Converge and variance test, it is proved that while executing the same program, the running time of the
cpu only with Scalar is 1.44 to 9.55 times that of the CPU with Vector module. Further research will be
conducted on processor pipelining, which will greatly improve the data communication performance of
VPU's parallel computing.
Acknowledgments
This research was supported by the science and technology project of State Grid Corporation of China,
Research on Key Technology of High Performance of Embedded CPU Core (No. 5700-202141449A-0-
0-00).
References
[1] Yumu Wang, Zhiming Pan, Pengfei Wu, Wei Fu, Lelan Tian, Guirun Li, Yiqun Sun, “Porting
Optimization of Yolov3 Based on RISC V Vector Instruction Set”, Microcontrollers &
Embedded Systems, December 2021, 30-35,40.
[2] Sheng Liu, Bo Yuan, Yang Guo, Haiyan Sun, Zekun Jiang, Vector Memory-Access Shuffle
Fused Instructions for FFT-like Algorithms, Chinese Journal of Electronics, January 2023, 1-
12.
[3] Yuluo Guo, Haodong Bian, Runting Dong, Jiahao Tang, Xiaoying Wang, Jianqiang Huang,
“Parallel Fourier Space Image Similarity Calculation Based on SIMD”, Computer Engineering,
November 2021, 253-259.
[4] Guang Wang and Xiangjun Li,A Design of Reconfigurable Bus Based on the Embedded
Microprocessor SIMD Core”, IEEE 4th International Conference on Software Engineering and
Service Science, May 2013, 740741.
[5] Haiyan Chen, Chao Yang, Sheng Liu, Zhong Liu, “An Efficient SIMD Parallel Storage Structure
Oriented to Radix-2 FFT Algorithm”, Journal of Electronics, 2016, 3-8.
[6] Yongjiu Feng, Xinjun Chen, Feng Gao, Yang Liu, Impacts of changing scale on Getis-Ord Gi*
hotspots of CPUE: a case study of the neon flying squid(Ommastrephes bartramiiin the
northwest Pacific Ocean, Acta Oceanologica Sinica, May 2018, 7180.
[7] Kaixuan Zhang, Li Ding, Yujie Cai, Wenbo Yin, Fan Yang, Jun Tao and Lingli Wang, “A High
Performance Real-Time Edge Detection System with NEON”, IEEE 12th International
Conference on ASIC, October 2017.
[8] Youyao Liu, Zhongwei Zhang, Design of instruction level parallel structure based on SIMD
architecture, Electronic Design Engineering, November 2017, 152-156.
[9] Fengjiao Li, Naijie Gu, Dongsheng Qi, Junjie Su, “Vectorization Study for FFT Algorithm Based
on ARM SVE”, Journal of Chinese Computer Systems, October 2022, 3-7.
[10] Cunyang Wei, Haipeng Jia, Yunquan Zhang, Guoyuan Qu, Dazhou Wei, Guangting Zhang,
Implementation and optimization of image processing algorithms based on ARMv8 CPUs,
Computer Engineering & Sciences, October 2022, 1711-1720.
IPEC-2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/1742-6596/2560/1/012027
9
[11] Wei Gao, Yingying Li, Huihui Sun, Yanbing Li , Rongcai Zhao, “An Improved Control Flow
SIMD Vectorization Method”, Journal of Software, 2017, 28(8):18.
[12] Bingchao Li, Jizeng Wei, Wei Guo, Jizhou Sun, Improving SIMD Utilization with Thread-
Lane Shuffled Compaction in GPGPU, Chinese Journal of Electronics, April 2015, 22-26.
[13] M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow", Proc. of
International Symposium on High Performance Computer Architecture, Shenzhen, China, 2013,
591-602.
[14] Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, et al., "SIMD divergence
optimization through intra-warp compaction", Proc. of International Symposium on Computer
Architecture, Tel-Aviv, Israel, 2013, 368-379.
ResearchGate has not been able to resolve any citations for this publication.
Article
The shuffle operations are the bottleneck when mapping the FFT-like algorithms to the vector single instruction multiple data (SIMD) architectures. We propose six (three pairs) innovative vector memory-access shuffle fused instructions, which have been proved mathematically. Combined with the proposed modified binary-exchange method, the innovative instructions can efficiently address the bottleneck problem for decimation-in-frequency or decimation-in-time (DIF/DIT) radix-2/4 FFT-like algorithms, reach a performance improvement by 17.9%-111.2% and reduce the code size by 5.4%-39.8%. In addition, the proposed instructions fit some hybrid-radix FFTs and are suitable for the terms of the initial or result data placement for general algorithms. The software and hardware costs of the proposed instructions are moderate.
Article
We examined the scale impacts on spatial hot and cold spots of CPUE for Ommastrephes bartramii in the northwest Pacific Ocean. The original fishery data were tessellated to 18 spatial scales from 5′×5′ to 90′×90′ with a scale interval of 5′ to identify the local clusters. The changes in location, boundaries, and statistics regarding the Getis-Ord Gi* hot and cold spots in response to the spatial scales were analyzed in detail. Several statistics including Min, mean, Max, SD, CV, skewness, kurtosis, first quartile (Q1), median, third quartile (Q3), area and centroid were calculated for spatial hot and cold spots. Scaling impacts were examined for the selected statistics using linear, logarithmic, exponential, power law and polynomial functions. Clear scaling relations were identified for Max, SD and kurtosis for both hot and cold spots. For the remaining statistics, either a difference of scale impacts was found between the two clusters, or no clear scaling relation was identified. Spatial scales coarser than 30′ are not recommended to identify the local spatial patterns of fisheries because the boundary and locations of hot and cold spots at a coarser scale are significantly different from those at the original scale.
Article
SIMD extension is an acceleration component integrated into the general processor for developing data level parallelism in multimedia and scientific computing applications. Control dependence hinders exploiting data level parallelism in the programs. Current vectorization methods in the presence of control flow, loop-based or SLP, all need if-conversion. They do not consider the SIMD parallelism in programs, and thus result in a lower SIMD executing efficiency. Another factor that leads to low efficiency is the lack of cost model to direct SIMD code generation in the presence of control flow. To address these problems, an improved control flow SIMD vectorization method is proposed. First, loop distribution with control dependence is put up to separate the vectorizable parts and unvectorizable parts, taking data locality into account simultaneously. Second, a direct vectorization method of control flow is presented considering the reuse of data between basic blocks. Finally, cost model is used to guide the generation of select and BOSCCs to improve the efficiency of SIMD code. Experimental results show that the code performance generated by the improved methods increase by 24% compared with those of the existing control flow vectorization methods. © Copyright 2017, Institute of Software, the Chinese Academy of Sciences. All rights reserved.
Article
GPGPUs adopt SIMT execution model in which each logical thread in a warp corresponds to a SIMD lane while can still follow an independent control flow. When a branch divergence appears and threads within a warp take different execution paths, GPGPUs have to execute each path serially through SIMD lane masking, which potentially decreases the SIMD utilization and performance. We propose an efficient thread compaction mechanism to handle branch divergence with a novel register file structure. We also develop a new thread scheduling policy cooperating with our compaction mechanism. The simulation results show that our approach improves the SIMD utilization up to 74.4% and achieves a maximum 11.1% performance speedup with small hardware overhead.
Conference Paper
SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.
Conference Paper
A data parallel computer architecture model based on the reconfigurable bus is proposed. First, for the problems existing in the modern multimedia processing, one dimensional processing element array architecture based on the reconfigurable bus is put forward; secondly, the communication module and the data transmission mode between each processing element are designed, namely, a design based on a reconfigurable bus; finally, test and verification of several commonly-used image processing algorithms indicate that one dimensional SIMD architecture based on the reconfigurable bus is logically feasible.
Conference Paper
Current graphics processing units (GPUs) utilize the single instruction multiple thread (SIMT) execution model. With SIMT, a group of logical threads executes such that all threads in the group execute a single common instruction on a particular cycle. To enable control flow to diverge within the group of threads, GPUs partially serialize execution and follow a single control flow path at a time. The execution of the threads in the group that are not on the current path is masked. Most current GPUs rely on a hardware reconvergence stack to track the multiple concurrent paths and to choose a single path for execution. Control flow paths are pushed onto the stack when they diverge and are popped off of the stack to enable threads to reconverge and keep lane utilization high. The stack algorithm guarantees optimal reconvergence for applications with structured control flow as it traverses the structured control-flow tree depth first. The downside of using the reconvergence stack is that only a single path is followed, which does not maximize available parallelism, degrading performance in some cases. We propose a change to the stack hardware in which the execution of two different paths can be interleaved. While this is a fundamental change to the stack concept, we show how dual-path execution can be implemented with only modest changes to current hardware and that parallelism is increased without sacrificing optimal (structured) control-flow reconvergence. We perform a detailed evaluation of a set of benchmarks with divergent control flow and demonstrate that the dual-path stack architecture is much more robust compared to previous approaches for increasing path parallelism. Dual-path execution either matches the performance of the baseline single-path stack architecture or outperforms single-path execution by 14.9% on average and by over 30% in some cases.
Vectorization Study for FFT Algorithm Based on ARM SVE
  • Li
Implementation and optimization of image processing algorithms based on ARMv8 CPUs
  • Wei