Content uploaded by Lorenzo Zuolo
Author content
All content in this area was uploaded by Lorenzo Zuolo on Mar 12, 2017
Content may be subject to copyright.
**Work done at Università degli Studi di Ferrara
1
IOPS and QoS Analysis of
DRAM/Flash-based and All-MRAM based NVRAM cards
Lorenzo Zuolo†**, Cristian Zambelli†, Terry Hulett*, Ben Cooke*, Rino Micheloni‡**, and Piero Olivo†
†Dipartimento di Ingegneria – Università degli Studi di Ferrara – Via Saragat, 1 – 44122 Ferrara (Italy)
SSDVision – Via Dosso Dossi, 11 – 44122 Ferrara (Italy)
*Everspin Technologies, Chandler, Arizona 85224 (USA)
‡Microsemi Corporation – Via Torri Bianche, 1 – 20871 Vimercate (Italy)
Abstract—PCIe DRAM/Flash-based NVRAM (Non-Volatile RAM)
cards are gaining traction in the market because they can be used either
as a very fast and secure synchronous write buffer, or to store both criti-
cal system data and user data in case of Power Failure. In a nutshell, the
host sees the NVRAM card as a bunch of DRAM devices connected over
a PCIe bus. If the power suddenly disappears, the on-board controller
copies the DRAM content to a bank of Flash memories; during this copy
operation, a super capacitor supplies the necessary energy.
MRAM memories are now mature enough to offer a technically via-
ble alternative to the combination of DRAM and Flash, thus removing
the need for a super capacitor, because of the MRAM inherent non-
volatility. In this work, we present a thorough analysis of IOPS and la-
tency (QoS) for both DRAM/Flash-based and All-MRAM NVRAM
cards. Results of simulations indicate that MRAM NVRAM cards can
compete with legacy DRAM/Flash cards, at least when looking at per-
formance figures such as random read/write IOPS and latency.
Introduction. Non-Volatile RAM (NVRAM) cards address the
need of persistency for small, frequent and/or transient data. For ex-
ample, they can be used as a very fast and secure synchronous write
buffer, which can quickly acknowledge synchronous writes, without
compromising data integrity/persistency. Another typical use case for
NVRAM cards is the storage of critical system data and user data in
case of Power Failure: amongst the others, File System Metadata,
Transaction Logs, and Cache Index Tables.
The most common architecture of existing NVRAM cards is
shown in Fig. 1a [1]. The card is connected to the Host via a PCIe
interface and it contains DRAM and NAND Flash memories [2], to-
gether with a controller and a super capacitor. DRAM is used as a
primary storage, in the sense that it is directly exposed to the Host
via PCIe. On the contrary, NAND Flash memories are hidden to the
Host and they are activated only in case of abrupt power down.
When this happens, the on-board controller copies the entire DRAM
content into the Flash array, thus making data non-volatile. Given the
fact that the external power supply has been shut down, the energy
required for reading from the DRAM and writing to the NAND is
supplied by a super capacitor. As a matter of fact, NAND Flash de-
vices, most part of the controller, and the super capacitor are there
only because DRAMs are volatile, and all these additional compo-
nents have an impact on cost, power and reliability. Therefore, there
is a continuous research for a non-volatile technology that can sim-
plify the overall design of NVRAM cards.
All-MRAM NVRAM cards. Given the most recent advancements
[3], MRAM technology seems to be a technically viable alternative
for building simplified NVRAM cards, which would look like Fig.
1b. Because high performance is the main value proposition of
NVRAM cards, it is key to understand how the number of read/write
random IOPS and latency figures (i.e QoS, Quality of Service) would
be impacted by replacing DRAM with MRAM. In this work we pre-
sent a detailed IOPS/QoS analysis based on a commercial
DRAM/Flash NVRAM card [1] combined with Everspin’s 256 Mbit
perpendicular Spin-Transfer-Torque (STT) MRAM [4].
To get started, we successfully proved the interoperability between
the on-board controller and the selected MRAM device, over the
DDR3 bus. Figure 2 shows the validation board of the on-board con-
troller that was used for this test: MRAM devices fully populate a
UDIMM which is vertically mounted on the validation board. A cus-
tom GUI, shown in Fig. 3, was also developed to check interopera-
bility under different workload conditions (i.e. different combina-
tions of read/write operations).
Data correlation and Simulation Framework. Besides the experi-
mental set up of Fig. 2 we wanted to develop a simulation platform
to enable the design of the new generation NVRAM cards based on
the All-MRAM approach. The selected simulation framework is
SSDExplorer [5] because it is a Fine-Grained Design Space Explora-
tion (FGDSE) tool, well suited for evaluating the impact of micro-
architectural design choices on performances. Figure 4 sketches the
architecture of the adopted simulation framework, while Table 1
summarizes the main characteristics of the simulated host system and
NVRAM cards. Figure 5 shows the comparison between actual per-
formances [6] and SSDExplorer simulation results in terms of IOPS:
there is a great matching for both Random Read and Random Write
workloads. It is worth highlighting how the matching is consistent
across different queue depths.
DRAM/Flash-based NVRAM vs. All-MRAM NVRAM. Figure 6
shows a direct IOPS comparison between the 2 architectures de-
scribed in Fig. 1. Also in this case we considered random write and
random read workloads and different queue depths. As a matter of
fact, the All-MRAM architecture can keep up with the requested
number of transactions without any significant performance degrada-
tion. While being counter-intuitive, this result can be explained by
noting that the simulated NVRAM architecture sits beyond a PCIe
interface, which turns out to be the bottleneck of the system.
Last but not least, let’s take a look at latencies for both random read
and write workloads. Indeed, these latencies are becoming more and
more critical, especially when looking at applications where fast re-
sponse is critical, such as financial trading and e-commerce. Delays
introduced by host systems could mask actual card’s performance;
therefore, latency has been evaluated as the raw latency introduced
by the card only. Fig. 7 and Fig. 8 show the Cumulative Distribution
Function (CDF) of the latencies for queue depths of 8 and 256, under
100% 4 kByte random write and 100% 4 kByte random read work-
loads, respectively. Similarly to what we have seen for IOPS, All-
MRAM NVRAM cards can respond as quickly as legacy cards, and
this is true also when looking at the upper part of the CDF. It is
worth highlighting that small differences in the latency profile are
negligible when adding the overhead introduced by the application
software.
Conclusions. In this work we have shown that a 256 Mbits STT-
MRAM is a viable alternative for replacing DRAM inside NVRAM
cards. Number of IOPS and latency figures have been extensively
analyzed under different workload conditions and queue depths. In
all cases, we haven’t detected any significant performance degrada-
tion with respect to the DRAM/Flash legacy solutions. MRAM-based
architectures can definitely simplify the card design by removing the
need for Flash memories and the super-capacitor. STT-MRAM de-
vices can already be used today and will become even more interest-
ing with the coming 1Gb and future 4Gb densities. Looking forward,
the overall cost and power of the NVRAM card need to be assessed.
References
[1] FlashTec NVRAM Drives, Microsemi. 2016. [Online]. Available:
http://www.microsemi.com/products/storage/flashtec-nvram-drives/flashtec-nvram-drives
[2] R. Micheloni et al., Inside NAND Flash Memories, Springer, 2010.
2
[3] D. Apalkov et al., Magnetoresistive Random Access Memory, Proceedings of the IEEE, pp.
1796-1830, Vol. 10, October 2016.
[4] J.M. Slaughter, et al., Technology for Reliable Spin-Torque MRAM Products, Proc. 2016
IEDM, 21.5, IEEE (2016).
[5] L. Zuolo et al., IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems, 2015. Vol. 43, no. 10, pp. 1627-1638.
[6] PMC-Sierra Flashtec NVRAM Drive Review. [Online]. Available:
http://www.tomsitpro.com/articles/pmc-sierra-flashtec-nvram-drive,2-954.html
(a)
(b)
Figure 1. Block diagram of a DRAM/Flash NVRAM card (a) a nd of an All-MRAM card
(b).
Figure 2. SSD controller evaluation board used for interoperability tests. Eight standard
SO-DIMM sockets can accommodate NAND flash memory cards, while one standard
DDR UDIMM socket can host either DRAMs or MRAM memories.
Figure 3. Custom developed Graphical User I nterface (GUI) for MRAM testing. This
GUI helps progra mming the SSD controller to issue single/multiple read/write opera-
tionsto/from the MRAM DIMM.
NVRAM card parameter
Configuration
Host Interface
PCI-Express Gen3 x8
Host protocol
NVMe 1.1
DRAM/MRAM size
1 GByte
DRAM/MRAM controller
Single channel
Host System
Configuration
Intel S2600GZ server
Dual Xeon E5-2680 v2
128 GBytes DRAM
Table 1. Main characteristics of the host syste m and the simulated NVRAM cards.
Figure 4. Architecture of the simulation framework used to test the performance and the
latency of both the DRAM/Flash and the All-MRAM NVRAM cards. The parameters of
the SSD simulator can be tuned to simulate a wide variety of SSD architectures and
memories. Qemu is used as a workload generator.
(a)
(b)
Figure 5. Performance comparison between a real (red das hed columns) and a simulated
(blue solid columns) DRAM/Flash-based card when 100% 4 kBytes random write (a)
and 100% 4 kBytes random read (b) are considered, respectively. Different host interface
queue depths are considered.
(a)
(b)
Figure 6. Performance comparison between a real DRAM/Flash-based card (red dashed
columns) and a simulated All-MRAM card (blue solid columns) when 100% 4 kBytes
random write (a) and 100% 4 kBytes random read (b) are considered, respectively. Dif-
ferent host queue depths are considered.
Figure 7. Latency cumulative distribution functions of the NVRAM cards when the
standard DRAM/Flash arc hitecture and the All-MRAM configurations are considered,
respectively. A queue depth (QD) of 8 and 256 commands have been used. Simulated
workload is 100% 4 kByte random write.
Figure 8. Latency cumulative distribution functions of the NVRAM cards when the
standard DRAM/Flash arc hitecture and the All-MRAM configurations are considered,
respectively. A queue depth (QD) of 8 and 256 commands have been used. Simulated
workload is 100% 4 kByte random read.