The process of DNA-based data storage (DNA storage for short) can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the
, is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, referred to as
, for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derive some upper bounds on the size of the sequence-subset codes including a tight bound for a special case, a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, including an optimal construction for that special case, which imply lower bounds on the size of such codes.