Conference PaperPDF Available

DSS: High I/O Bandwidth Disaggregated Object Storage System for AI Applications

November 2021

November 2021

Conference: Open Compute Project (OCP) Future Technologies Symposium
At: San Jose, CA

Authors:

Northeastern University

With exponential data generation rate, specifically in data-intensive applications like Deep learning, AI, etc., the demand for storage with high-bandwidth and great scalability that supports unstructured data format is increasing. To fulfill this need, Samsung proposes DSS storage solution, which implements object Key-Value API on top of NVMe over Fabric (NVMeOF) SSD hardware. The new solution has the same attributes as NVMeOF and is designed explicitly for storing object format. The support of storage remote access protocols (i.e., RDMA) facilitates the disaggregation of storage and computational resources. Therefore, storage can be easily scaled. Besides supporting object storage and scalability, our proposed architecture can provision the bandwidth demands for each application running on each client server. This paper introduces our DSS Storage systems that support high-bandwidth per capacity for object-format data with an effortlessly scale-up feature. DSS uses some methods to deterministically provide bandwidth to the client sessions to mitigate the contention and starvation. Therefore, our storage design is essential for large concurrent multi-session workloads with intensive reads such as AI training. Our design achieves throughput 180-275 GB/sec for read, 26-38 GB/sec for write when evaluated by S3 Benchmark.

(a) DSS Architecture overview (client/object storage servers) (b) DSS back-end and front-end components and the network design

…

Universal Fabric Manager Architecture

…

Figures - uploaded by Mahsa Bayati

Content may be subject to copyright.

Content uploaded by Mahsa Bayati

Content may be subject to copyright.

DSS: High I/O Bandwidth Disaggregated Object

Storage System for AI Applications

Mahsa Bayati

Memory Solution Lab

Samsung Semiconductor Inc.

San Jose, CA

mahsa.b@samsung.com

Harsh Roogi

Memory Solution Lab

Samsung Semiconductor Inc.

San Jose, CA

h.roogi@samsung.com

Somnath Roy

Memory Solution Lab

Samsung Semiconductor Inc.

San Jose, CA

som.roy@samsung.com

Ron Lee

Memory Solution Lab

Samsung Semiconductor Inc.

San Jose, CA

r2.lee@samsung.com

Abstract—With exponential data generation rate, speciﬁcally

in data-intensive applications like Deep learning, AI, etc., the de-

mand for storage with high-bandwidth and great scalability that

supports unstructured data format is increasing. To fulﬁll this

need, Samsung proposes DSS storage solution, which implements

object Key-Value API on top of NVMe over Fabric (NVMeOF)

SSD hardware. The new solution has the same attributes as

NVMeOF and is designed explicitly for storing object format. The

support of storage remote access protocols (i.e., RDMA) facili-

tates the disaggregation of storage and computational resources.

Therefore, storage can be easily scaled. Besides supporting object

storage and scalability, our proposed architecture can provision

the bandwidth demands for each application running on each

client server. This paper introduces our DSS Storage systems

that support high-bandwidth per capacity for object-format data

with an effortlessly scale-up feature. DSS uses some methods

to deterministically provide bandwidth to the client sessions to

mitigate the contention and starvation. Therefore, our storage

design is essential for large concurrent multi-session workloads

with intensive reads such as AI training. Our design achieves

throughput 180-275 GB/sec for read, 26-38 GB/sec for write when

evaluated by S3 Benchmark.

Index Terms—Dissaggregated Storage, Object Storage, high

I/O bandwidth, AI Data intensive applications

I. INTRODUCTION & MOTIVATION

The fast-growing amount of data should be stored on and

retrieved from advanced storage devices that can manage high

I/O bandwidth, meet the user’s desired service quality, scale

resources easily. Some leading storage companies claim read

I/O bandwidth around 24 GB/sec per node with a maximum

of 256 TB capacity for four nodes. Samsung DSS Solution,

with similar bandwidth, conveniently scales up, i.e., supports

more than 256 TB for two nodes.

Besides, recently the more signiﬁcant portion of generated

data is object format. The current storage systems use block

SSD devices; thus, there is an overhead to convert the object to

block format. Samsung recently developed a new Key-Value

APIs for NVMeOF devices that support storing the object

format data directly into the disk without the operating system

involvement for block and object conversion. The new Key-

Value design is based on the NVMeOF SSDs, enabling access

to storage devices over fabric in a disaggregated fashion. The

disaggregated system scales easier by adding more resources

to the storage or computational nodes. Applications such as

AI and Deep Learning employ unstructured data and have

aggressive demand for storage; thus, our disaggregated storage

is the best match.

One of the main concerns in the storage server that provides

the data access to multiple client application sessions run-

ning simultaneously (e.g., the AI training sessions accessing

the storage simultaneously) is bandwidth inconsistency and

Fig. 1. (a) DSS Architecture overview (client/object storage servers) (b) DSS

back-end and front-end components and the network design

congestion. This usually happens when one client session

aggressively seizes the bandwidth, or many client sessions are

competing with each other. Therefore, the quality of service

and the run-time are unpredictable. DSS has the feature to de-

terministically provide each client session with their required

quality of service and provide them with the completion time

prediction.

Accordingly, our storage server provides a non-congestion

access design for object format storage with effortless scal-

ability. The rest of this paper is organized as follows, Sec.

II, discusses DSS architectural design, in Sec. III, we present

the results of our system evaluation. Finally, we conclude and

provide our future work.

II. ARCHITECTURAL DESIGN

DSS has three essential components including, (I) object

storage servers, (II) client servers, and (III) network connecting

these clients and storage servers together (See Fig. 1 (a))

A. Architectural Components

1) Storage Server: Our storage servers can be parted into

two sections (I) front-end (II) back-end, i.e., target. In this

paper, we co-locate both components on one server, reduc-

ing the cost and simpliﬁes the network and communication

setup. As shown in Fig 1 (b), the front-end mainly includes

modiﬁed MinIO [1] software, i.e., a well-known Amazon S3

compliant open-source object-storage, that uses KV API for

data store access. We have modiﬁed stock MinIO to run

in a distributed shared-everything Key-Value environment for

improved scaling and performance. MinIO is responsible for

data consistency by adopting erasure coding, which manages

faulty drives and random bit ﬂipping.

NVMe-oF supported Target stack is located on the back-

end and provides high-performance key-value services over

the RDMA and IP-based networks. Target application software

is designed to run in user mode and can abstract out the

SSD devices and perform Key-Value operations. To offer the

abstraction of storage pool in distributed storage, we employ

two components: (I) KV pool, which can be mapped to

one or more SSD devices, and (II) subsystem, that provides

the object storage abstraction required to pool many SSDs

using one or more Namespace(s)/ Container(s) with aggregated

performance. Each of the subsystems can be exported to a

client application as a single Namespace or Container.

Another essential component in the back-end is UFM

(Uniﬁed Fabric Manager). UFM is a lightweight ecosystem

software that manages Samsung devices, shown in Fig. 2.

UFM manages any topology, architecture, and storage ( in

this paper, KV-SSD). In our architecture, the UFM mainly

manages the fabric, discovers, monitors, and conﬁgure devices

and networks. It also collects logs and statistics to ensure the

cluster is working properly.

Fig. 2. Universal Fabric Manager Architecture

2) Network Setup: DSS supports multiple high-speed Eth-

ernet network ports, where each storage server has 4x dual-

port NICs. Thus, the network software stack will support two

different protocols: (I) TCP IP, Front-end VLANs is set up for

interaction of the client and storage server. Client operates on

the object data by sending GET/ PUT/ LIST/ DEL to MinIO

using these VLANs, which handles S3 trafﬁc (See Fig 1 (b)

VLANs 40:43). (II) RoCEv2 trafﬁc Back-end VLANs is only

for the RDMA access of target to the subsystem, and clients

will not interact with this network; they send their requests

indirectly through MinIO. VLANs 30:34 direct the object

trafﬁc that is coming from the targets. Therefore, targets listen

to the RDMA ports and manage the access to the objects in

each drive.

3) Client Servers: Client servers are responsible for running

the application and requesting the data from the storage. To

facilitate accessing our object storage, we developed DSS

Client Library, which includes APIs that function as a medium

between the client and our DSS storage. Client library is

responsible for loading the requested data from storage and

distribute data among the DSS storage servers. DSS client

library accesses the storage servers and performs the actual

S3 operations such as PUT/ GET/ DEL/ LIST. It takes cluster

conﬁgurations containing a list of endpoints as input and

maximizes performance by load balancing and distributing the

user request to the endpoints.

III. EXP ER IM EN TS & RES ULTS

Experimental Setup- We evaluate DSS architecture using 10

homogeneous storage servers with 16 client servers. Table I

represents the hardware speciﬁcation of our storage and client

servers.

TABLE I

HARDWARE SP ECI FIC ATIO N

Storage Server Client

CPU Type AMD EPYC 7742 Dell R740xd

CPU Speed 3.4GHz 2.6 GHz

Num of Cores 64 24

OS CentOS Linux CentOS

NIC 4x Dual 200GbE 2x100GbE

Storage Node SSD PM1733 (16X) 4TB N/A

Results- To measure the performance of DSS system, we

ran S3 benchmark [2] with 30 TB data of 1MB object size.

We measure the throughput with and without Erasure Coding

(EC). As shown in table II, our DSS storage server can achieve

around 180 to 275 GB/sec for GET, 26 to 38 GB/sec for PUT

operations.

TABLE II

DSS ARCHITECTURE PERFORMANCE USING S3-BENCHMARK

With EC (GB/sec) Without EC (GB/sec)

PUT 26.27 38.2

GET 180 275.4

IV. CONCLUSIONS & FUTURE WORK

In this paper, we introduce DSS, our new object storage

system, which deploys object storage Key-Value APIs on

NVMeOF SSDs. DSS is a disaggregated storage system that

features deterministic high I/O bandwidth and scalability over

object storage. We measure DSS throughput for read and write,

which are around 2 and 1 orders of magnitude accordingly.

In the future, we improve our storage system performance

further, by enabling S3 service over RDMA to eliminate

http/TCP copy overhead. We also plan to ofﬂoad the S3 Select

API to our FPGA unit on smart SSDs and ﬁlter the data before

transferring to the client. Therefore clients only retrieve the

data they need. In addition, we will complete our provisioning

technique for providing consistent bandwidth for each client

session based on their Quality of Service (QoS).

REFERENCES

[1] “MinIO object storage cluster,” https://min.io/.

[2] “S3-benchmark,” https://github.com/wasabi-tech/s3-benchmark.

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

DSS: High I/O Bandwidth Disaggregated Object Storage System for AI Applications

Abstract and Figures

Recommended publications

Deploying Network Key-Value SSDs to Disaggregate Resources in Big Data Processing Frameworks

Exploring Benefits of NVMe SSDs for BigData Processing in Enterprise Data Centers

Providing QoS through host controlled flash SSD garbage collection and multiple SSDs

RITA: Efficient Memory Allocation Scheme for Containerized Parallel Systems to Improve Data Processi...