Available via license: CC BY 3.0
Content may be subject to copyright.
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
Design of a High Performance Vector Processor
Based on RISICV Architecture
To cite this article: Xiaojing Han
et al
2023
J. Phys.: Conf. Ser.
2560 012027
View the article online for updates and enhancements.
You may also like
GPUbased highperformance computing
for radiation therapy
Xun Jia, Peter Ziegenhein and Steve B
Jiang

GPUaccelerated simulations of isolated
black holes
Adam G M Lewis and Harald P Pfeiffer

MultiThreaded Algorithms for GPGPU in
the ATLAS High Level Trigger
P. Conde Muíño and on behalf of the
ATLAS Collaboration

This content was downloaded from IP address 213.188.67.159 on 26/08/2023 at 13:34
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
1
Design of a High Performance Vector Processor Based on
RISICV Architecture
Xiaojing Han1,a*, Liang Liu2,b, Zhe Zhang1,c, Yufeng Sun1,d, Jiahui Zhou1,e, Hao
Cai3,f
1Chip Technology Department, Beijing Smartchip Microelectronics Technology CO.,
LTD, Beijing, China
2Digital Chip Design Center, Beijing Smartchip Microelectronics Technology CO.,
LTD, Beijing, China
3Information & Telecommunication Branch, State Grid Jiangsu Electric Power Co.,
Ltd, Nanjing, China
ahxj19870216@126.com; bliuliang@sgchip.sgcc.com.cn;
czhangzhe1@sgchip.sgcc.com.cn; dsunyufeng@sgchip.sgcc.com.cn;
ezhoujiahui@sgchip.sgcc.com.cn; f21123016@qq.com
* Corresponding author
Abstract. This paper proposes a high performance Vector processor based on the high
performance Embedded Core which is named TS800. The TS800 is a 4core processor based
on RISCV architecture, implements IMAFDV instruction set, supports L2 Cache, branch
prediction, sequential pipeline, and dualissue structure. The traditional CPU mainly supports
Scalar calculations, or only supports Vector calculations. For applications such as image and
signal processing, there are a large number of data parallel computing operations. To solve the
problem of low performance of parallel data calculations in industrial power applications, it is
proposed to add VPU hardware implementation in the TS800. The TS800 can support FFT
algorithm, adaptive controllers Reinforcement learning and learningbased underlying
algorithm requirements. In this paper, the module and data flow between each processing unit
and the control circuit, that is, the hardware realization of VPU module are proposed. Large
area units such as float arithmetic, multiplication and division are multiplexed with the Scalar
operator in the CPU, while the control circuit is placed in the VPUALU, and the area is small.
Units such as arithmetic and logic operation instructions, shift operation instructions,
comparison operation instructions, and permutation instruction are implemented through the
VPUALU, which makes the overall design area smaller and the performance better. At the
same time, through the fir, fft, conv, matrix, Signal Converge and variance test, it is proved
that while executing the same program, the running time of the cpu only with Scalar is 1.44 to
9.55 times that of the CPU with Vector module, which can support the underlying algorithm of
the adaptive controller.
Keywords. CPU, RISICV, VPU, parallelism, performance
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
2
1. Introduction
RISCV [1] is an open standard instruction set architecture (ISA) based on established RISC principles.
Unlike most other ISA designs, RISCV [2] is provided under open source licenses that do not require
fees to use.
With the development of the RISCV ecosystem, it will be more widely used in embedded or
emerging IoT, edge computing, artificial intelligence and other fields.
The traditional CPU mainly supports Scalar calculations, or only supports Vector calculations.
However, for applications such as image [3] and signal processing, there are a large number of data
parallel computing operations. To solve the problem of low performance of parallel data calculations in
industrial power applications, it is proposed to add VPU hardware implementation in the TS800.
SIMD [4] technology initially splits the data of 64bit registers into multiple 8bit, 16bit, and 32bit
forms to realize parallel computing of byte, half word, and word type data; in the future, in order to
further increase the parallelism of computing, SIMD technology [5] began to meet the application's
demand for computing power by increasing the bit width of the register. For traditional SIMD
technology [6] Intel's MMX, SSE series, AVX series, and ARM's Neon [7] architecture are all
representatives.
Another way to improve data parallelism [8] is Vector computing technology. The V (Vector)
extended instruction set of RISCV is a simplified Vector instruction set. Its length is flexible and
variable, and it has good support and application for linear algebra operations such as multimedia.
Vector computing technology is a technology that decouples hardware and software more, is more
friendly to programmers. From the comparison between SIMD [9] and Vector, we can see that since the
instruction set limits the data operation bit width, each expansion of the parallelism [10] of the
hardware means the expansion of the instruction set and the rewriting of the code [11], which will add
more extra labor and affect the development.
Due to the demand in the field of industrial control, we developed a CPU core based on RISCV
architecture, named TS800. To solve the problem of low performance of parallel data calculations [12]
in industrial power applications, it is proposed to add a Vector processor hardware implementation
named VPU in the TS800, which can support adaptive controllers Reinforcement learning and learning
based [13] underlying algorithm requirements. This design can support FFT and IFT of 64~4096 point
[14].
In the following chapter 2, we will introduce the implementation of VPU module and the design
method of hardware circuit optimization area. In chapter 3, we will introduce the verification of VPU
module. In chapter 4, we will introduce the performance of VPU module. The last chapter 5 is the
conclusion.
2. Vector processor implementation
2.1. TS800 architecture
The TS800 is a RISCV processor that supports multicore and multicluster, fully supports RISCV
Vector extension standard and the TS800 Intelligence Extensions, and is optimized for edge AI/ML/RL
computing. The TS800 is a 4core processor based on RISCV architecture, implements IMAFDV
instruction set, supports L2 Cache, branch prediction, sequential pipeline, and dualissue structure. The
TS800 is ideal for applications requiring highthroughput, singlethreaded performance, and power
constrained applications such as AR/VR, sensor hubs, IVI systems, IP cameras, digital cameras, gaming
rigs, etc. The TS800 architecture block diagram is shown in Figure 1, the red box is the Vector
processor part, named as VPU.
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
3
TS800
Core3
Core2
Core1
PLIC Debug, Trace Timers PMU
Core0
VPU
LSU
FB
L1 I Cache L1 D Cache
Scheduler
ALU WB & Exception
handler
L2
L2 Cache SCU ACE master ACP slave
CLK RESET
FPU
Figure 1. Taishan 800 architecture
2.2. VPU architecture
The VPU module architecture block diagram is shown in Figure 2.
The VPU module consists of 8 main functional modules which is Specifically by the blue frame
diagram:
V_IB,V_DEC,V_LAU,V_LS,V_ARITH_CTRL[0]~V_ARITH_CTRL[3],V_ALU[0]~V_ALU[3],V_
WB and V_VRF.
Interactive interface which is shown as orange box diagram is as follows: signal interface with
SCHDULE module, signal interface with integer int ALU module, signal interface with floating point
FPU module, signal interface with LSU module and write back to the signal interface of the WB
module.
The data flow interaction with other modules of the main pipeline is as follows. For integer
multiplication and division operations, due to the multiplexing of the underlying operation unit and the
integer pipeline, V_ALU outputs control signals and data to the int ALU module. After the operation is
completed, int ALU returns the result to V_ALU, which is processed by V_ALU and then sent to the
WB module.
Figure 2. VPU module architecture
The function, data flow and control circuit of each module are as follows:
V_IB is a Vector instruction buffer module. It is used to receive Vector instructions from the
mainstream pipeline scheduling module, record and maintain lock/unlock, valid/invalid, PC information
of Vector instructions.
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
4
V_DEC is a Vector instruction decoding module. It is used to complete the decoding of the detailed
information of the Vector instruction.
V_LAU is a Vector instruction scheduling module. According to the information input in the WB
register, the Vector access module V_LS, the busy and idle conditions of the three operation channels
(ARITH_CTRL module), and the classification of operation instructions, V_LAU schedules operation
instructions.
V_LS is a Vector memory access control module. It splits the operation of the current memory
access instruction according to the input CSR and element mask information, and reads/writes up to
128bits of data each time. After multiple memory accesses, a complete memory access instruction is
completed. V_ARITH_CTRL is the Vector operation instruction execution control module.
V_ARITH_CTRL splits the operation to be performed by the current Vector instruction according to
the configured CSR and onetime operation capability. Read the corresponding VRF for each operation,
analyze the specific operation type, and send the operation request and operands to the operation unit
V_ALU.
V_ALU is a Vector arithmetic logic operation control module. The fixedfloatingpoint arithmetic
logic operation of the whole Vector is completed by multiple modules. The fixedpoint multiplication
and division operations are completed in the ALU of the fixedpoint pipeline. Other fixedpoint
operations are completed in V_ALU. Floatingpoint operations are completed in the ALU of the
floatingpoint pipeline. V_ALU completes the centralized control and scheduling of these ALUs, as
well as the calculation function of the ALU inside the Vector.
V_WB is the Vector write back module. It is used to receive the operation result output by V_ALU,
process the memory access and write back request of V_LS, and complete the four write channel logics
of the heap Vector register file according to the MASK and tail write strategy of the current instruction.
According to the write response and the execution status of the current command, the retire of the
command is completed.
V_VRF is the Vector register file module. It completes the write operation to the Vector register by
receiving 4 write channels of V_WB. V_VRF completes the read operation of the Vector register file
by receiving 8 read channels in V_ARITH_CTRL and V_LS.
For floatingpoint operations, the underlying algorithm is implemented by FPU. V_ALU sends data
and control signals to FPU. After the operation is completed, float module returns the result to V_ALU,
which is processed by V_ALU and then sent to the WB module. Final instruction graduates in the
mainstream pipeline.
2.3. Vector instructions categories
Vector instructions are divided into three categories: configuration, operation, and memory access.
Among them, the operation and memory access instructions are implemented in the VPU, and the
configuration (vset) instructions are implemented in the mainstream pipeline. These three classes are
executed serially.
Vector instructions flow through the fixedpoint pipeline and are retired at the writeback stage of
the fixedpoint pipeline. The value of the fixedfloatingpoint generalpurpose register required in the
Vector instruction is provided by the main pipeline when the Vector instruction is issued, and is used as
the input of the module.
2.4. VPU module area optimization scheme
In order to optimize the area and performance of the VPU hardware circuit, this paper implements
circuit multiplexing for the VPU implementation. Largearea units such as float arithmetic,
multiplication and division are multiplexed with the Scalar operator in the CPU, while the control
circuit is placed in the VPUALU, and the area is small. Units such as arithmetic and logic operation
instructions, shift operation instructions, comparison operation instructions, and permutation instruction
are implemented through the VPUALU, which makes the overall design area smaller and the
performance better.
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
5
Instructions such as operations can be parallelized in the Vector processor, and consecutive memory
access instructions are serialized.
By multiplexing the FPU and int ALU circuits, we can see that the area can be reduced by 19.3% in
Table 1. Table 1. VPU area compared with optimized
modules
percentage of total area
optimized percentage of total area
VPU
17.70%
17.70%
FPU
13.90%
0%
int ALU
15.40%
0%
total percent
47.00%
17.70%
3. Verification
The verification is based on the UVM verification platform. The result check adopts the reference
model, scoreboard method, and there is a clear selfcheck process in each case. Using CDV (coverage
driven verification) verification methodology, on the basis of random use cases, direct use cases are
added to achieve 100% functional coverage and code coverage.
Vector instructions can be categorized as follows: arithmetic operation instructions, logic operation
instructions, shift operation instructions, comparison operation instructions, permutation instructions
and transfer instructions between registers and DRAM.
The performance of the CPU with only the Scalar operation unit and the performance results of the
CPU with the Vector are shown as follows. The fir execution process and simulation results are
illustrated by performing the arithmetic instructions as an example. In Scalar instructions, there are no
parallel execution units inside the controller. In Vector instructions, there are both Scalar instructions
and Vector instructions. The simulation result are given in this section by simulating fir programs
applying VCS.
3.1. Scalar instruction simulation
The testing codes is shown in Figure 3.
Figure 3. Scalar testing codes
Simulation results are shown in Figure 4.
Figure 4. Simulation results of Scalar instructions
3.2. Vector and Scalar instruction simulation
The testing codes of Vector and Scalar is shown in Figure 5.
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
6
Figure 5. Vector and Scalar testing codes
Simulation results of Vector and Scalar instruction are shown in Figure 6.
Figure 6. Simulation results of Vector and Scalar instructions
4. Performances
4.1. TS800 timing optimization results
Through the fir, fft, conv, matrix, Signal Converge and variance test, it is proved that while executing
the same program, the running time of the cpu only with Scalar is 1.44 to 9.55 times that of the CPU
with Vector module. It is shown in Table 2.
Table 2. Parrel processing clk cycles
arithmetic
no Vector/clk cycles
Vector/clk cycles
no Vector/Vector
fir
773,975
157,879
4.90
fft
1,655,920
567,172
2.92
conv
7,457,555
4,005,817
1.86
matrix
486,324
337,254
1.44
Signal Converge
9,281,420
971,422
9.55
variance
23,779
13,587
1.75
4.2. TS800 area optimization results
The layout of the chip are shown in Figure 7.We can see that the red block is VPU. It is 17.7% of the all
chip.
Figure 7. Simulation results of Vector and Scalar instructions
The area of the TS800 is shown in Table 3.
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
7
Table 3. TS800 areas results
modules
submodules
areas
percentage of
total area
percentage of
submodule
m800
/
719544.4
100
/
FPU
ALL
100007.5
13.9
100
fpu_alu0
10936.2
1.5
10.8
fpu_alu1
10943
1.5
10.8
fpu_div0
9882.5
1.4
10.1
fpu_div1
9731.7
1.4
10.1
fpu_mul0
14654.1
2
14.4
fpu_mul1
14451.2
2
14.4
fpu_regbank
21057.1
2.9
20.9
VPU
ALL
127560,1
17.7
100
vpu_valu0
7046.1
1
5.6
vpu_valu1
13586
1.9
10.7
vpu_valu2
17005.5
2.4
13.6
vpu_varthctl0
5346.9
0.7
4
vpu_varthctl1
6079.3
0.8
4.5
vpu_varthctl2
12427
1.7
9.6
vpu_vdec
1023
0.1
0.5
vpu_vib
2747.3
0.4
2.3
vpu_vlau
736.7
0.1
0.5
vpu_vlsu
14188.3
2
11.2
vpu_vwb
46571.3
6.5
36.7
int_ALU
ALL
111124.6
15.4
100
mul0
14736
2
13.26
mul1
14557
2
13.1
reg
13005
1.8
11.7
div0
24349
3.38
21.91
div1
24229
3.37
21.8
others
20335.8
2.83
18.3
bht_bim_ram0
/
886.4
0.1
/
bht_bim_ram1
/
886.4
0.1
/
bht_meta_ram0
/
886.4
0.1
/
bht_meta_ram1
/
886.4
0.1
/
bht_ram0
/
1616.1
0.2
/
bht_ram1
/
1616.1
0.2
/
genblk1_u_data_ram0
/
51545
7.2
/
genblk1_u_data_ram1
/
51545
7.2
/
genblk1_u_tag_ram0
/
3398.4
0.5
/
genblk1_u_tag_ram1
/
3397
0.5
/
bpu
/
8188.7
1.1
/
cpu_reset
/
24.8
0
/
fu
/
16350.4
2.3
/
intrpp
/
101.2
0
/
lsu
/
203042.9
28.2
/
sch
/
10596.8
1.5
/
ts800_trig_top
/
7280.6
1
/
wb_top
/
16828.9
2.3
/
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
8
In order to optimize the area and performance of the VPU hardware circuit, this paper implements
circuit multiplexing for the VPU implementation. Comparison with optimized area result is shown in
Table 1. By multiplexing the FPU and int ALU circuits, the area can be reduced by 19.3%.
5. Conclusions
For applications such as image and signal processing, there are a large number of data parallel
computing operations.To solve the problem of low performance of parallel data calculations in
industrial power applications, it is proposed to add VPU hardware implementation in the TS800.
Taking the key technology of data parallel architecture as the research object, with parallel computing
as the technical support, the VPU hardware circuit requirements of the TS800 are put forward, and it is
realized through system verilog and verified by UVM environment, and the data comparison of the
application is realized on the hardware accelerator, the results of simulation verification show that the
design of VPU based on RISICV architecture is logically feasible. By multiplexing the FPU and int
ALU circuits, in Table 1. the area can be reduced by 19.3%. Through the fir, fft, conv, matrix, Signal
Converge and variance test, it is proved that while executing the same program, the running time of the
cpu only with Scalar is 1.44 to 9.55 times that of the CPU with Vector module. Further research will be
conducted on processor pipelining, which will greatly improve the data communication performance of
VPU's parallel computing.
Acknowledgments
This research was supported by the science and technology project of State Grid Corporation of China,
Research on Key Technology of High Performance of Embedded CPU Core (No. 5700202141449A0
000).
References
[1] Yumu Wang, Zhiming Pan, Pengfei Wu, Wei Fu, Lelan Tian, Guirun Li, Yiqun Sun, “Porting
Optimization of Yolov3 Based on RISC V Vector Instruction Set”, Microcontrollers &
Embedded Systems, December 2021, 3035,40.
[2] Sheng Liu, Bo Yuan, Yang Guo, Haiyan Sun, Zekun Jiang, “Vector MemoryAccess Shuffle
Fused Instructions for FFTlike Algorithms”, Chinese Journal of Electronics, January 2023, 1
12.
[3] Yuluo Guo, Haodong Bian, Runting Dong, Jiahao Tang, Xiaoying Wang, Jianqiang Huang,
“Parallel Fourier Space Image Similarity Calculation Based on SIMD”, Computer Engineering,
November 2021, 253259.
[4] Guang Wang and Xiangjun Li,“A Design of Reconfigurable Bus Based on the Embedded
Microprocessor SIMD Core”, IEEE 4th International Conference on Software Engineering and
Service Science, May 2013, 740–741.
[5] Haiyan Chen, Chao Yang, Sheng Liu, Zhong Liu, “An Efficient SIMD Parallel Storage Structure
Oriented to Radix2 FFT Algorithm”, Journal of Electronics, 2016, 38.
[6] Yongjiu Feng, Xinjun Chen, Feng Gao, Yang Liu, “Impacts of changing scale on GetisOrd Gi*
hotspots of CPUE: a case study of the neon flying squid(Ommastrephes bartramii）in the
northwest Pacific Ocean”, Acta Oceanologica Sinica, May 2018, 71–80.
[7] Kaixuan Zhang, Li Ding, Yujie Cai, Wenbo Yin, Fan Yang, Jun Tao and Lingli Wang, “A High
Performance RealTime Edge Detection System with NEON”, IEEE 12th International
Conference on ASIC, October 2017.
[8] Youyao Liu, Zhongwei Zhang, “Design of instruction level parallel structure based on SIMD
architecture”, Electronic Design Engineering, November 2017, 152156.
[9] Fengjiao Li, Naijie Gu, Dongsheng Qi, Junjie Su, “Vectorization Study for FFT Algorithm Based
on ARM SVE”, Journal of Chinese Computer Systems, October 2022, 37.
[10] Cunyang Wei, Haipeng Jia, Yunquan Zhang, Guoyuan Qu, Dazhou Wei, Guangting Zhang,
“Implementation and optimization of image processing algorithms based on ARMv8 CPUs”,
Computer Engineering & Sciences, October 2022, 17111720.
IPEC2023
Journal of Physics: Conference Series 2560 (2023) 012027
IOP Publishing
doi:10.1088/17426596/2560/1/012027
9
[11] Wei Gao, Yingying Li, Huihui Sun, Yanbing Li , Rongcai Zhao, “An Improved Control Flow
SIMD Vectorization Method”, Journal of Software, 2017, 28(8):18.
[12] Bingchao Li, Jizeng Wei, Wei Guo, Jizhou Sun, “Improving SIMD Utilization with Thread
Lane Shuffled Compaction in GPGPU”, Chinese Journal of Electronics, April 2015, 2226.
[13] M. Rhu and M. Erez, "The dualpath execution model for efficient GPU control flow", Proc. of
International Symposium on High Performance Computer Architecture, Shenzhen, China, 2013,
591602.
[14] Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, et al., "SIMD divergence
optimization through intrawarp compaction", Proc. of International Symposium on Computer
Architecture, TelAviv, Israel, 2013, 368379.