Conference PaperPDF Available

Detection of Noisy and Corrupted Data Using Clustering Techniques

Authors:

Abstract and Figures

We investigate machine learning based on clustering techniques that are suitable for the detection of $n$-symbol words of $q$-ary symbols transmitted over a noisy channel with partially unknown characteristics. We consider the detection of the $n$-symbol $q$-ary data as a classification problem, where objects are recognized from a corrupted vector, which is obtained by an unknown corruption process.
Content may be subject to copyright.
Detection of Noisy and Corrupted Data Using
Clustering Techniques
Kui Cai
Singapore University of Technology and Design (SUTD)
8 Somapah Rd, 487372, Singapore
Email: cai kui@sutd.edu.sg
Kees Schouhamer Immink
Turing Machines Inc.
3016 DK Rotterdam The Netherlands
Email: immink@turing-machines.com
Abstract—We investigate machine learning based on clustering
techniques that are suitable for the detection of n-symbol words
of q-ary symbols transmitted over a noisy channel with partially
unknown characteristics. We consider the detection of the n-
symbol q-ary data as a classification problem, where objects are
recognized from a corrupted vector, which is obtained by an
unknown corruption process.
I. INTRODUCTION
In non-volatile memories, the reading of stored data is typ-
ically done through the use of predetermined fixed thresholds.
Due to process tolerances and/or drift over time of the written
physical features, however, fixed threshold usage may result
in a significant reduction of the error performance of the
memory [1].
We present new techniques for the detection of q-ary data
in the face of additive noise and unknown channel corruption
by drift. The new detection techniques are based on the
teachings of cluster analysis. Assume an n-symbol q-ary word
(x1, . . . , xn),xi∈ {0, . . . , q 1}is transmitted or stored, and
that the received word (r1, . . . , rn)is corrupted by additive
noise, intersymbol interference, and other unknown nuisance.
Retrieving a replica of the original q-ary data is seen as
the classification function (r1, . . . , rn)→ {0, . . . , q 1}.
Machine learning and deep learning are techniques that are
very suitable for classification tasks. The detection function
will be considered here as a classification problem, or object
recognition, which will be targeted by cluster analysis. Cluster
analysis is an example of unsupervised machine learning,
a common technique for statistical data analysis, used in
many fields, pattern recognition, image analysis, information
retrieval, data compression, and computer graphics [3].
We investigate a typical competitive learning algorithm,
named k-means clustering technique, which is an iterative
process that implements the detection function given initial
values of some basic parameters. The aim of the learning
algorithm is to map nreceived symbols into kclusters, where
in the case at hand the kclusters are associated with the
qsymbol values. The detector is ignorant of the number of
different symbol values in the sent word, so that kq. A
major challenge in cluster analysis is the estimation of the
This work is supported by Singapore Ministry of Education Academic
Research Fund Tier 2 MOE2016-T2-2-054, SUTD-ZJU grant ZJURP1500102,
and SUTD SRG grant SRLS15095.
optimal number of ‘clusters’ [4], [5]. The k-means clustering
technique does not allow to easily estimate the number of
(different) clusters, and therefore other means are needed to
estimate the number of clusters, k.
In mass data storage devices, the user data are translated
into physical features that can be either electronic, magnetic,
optical, or of other nature [6]. Due to process variations,
the magnitude of the physical effect may deviate from the
nominal values, which may affect the reliable read-out of the
data. We may distinguish between two stochastic effects that
determine the process variations. On the one hand, we have the
unpredictable stochastic process variations, and on the other
hand, we may observe long-term effects, also stochastic, due
to various physical effects. The probability distribution of the
recorded features changes over time, and specifically the mean
and the variance of the probability distribution may change.
The deviations from the nominal means, called offsets, can be
estimated using an aging model, but, clearly, the offsets depend
on unpredictable parameters such as temperature, humidity,
etc, so that the model-based prediction is inaccurate.
The usage of the Pearson distance in lieu of the traditional
Euclidean distance has been advocated for channels with
unknown (mismatched) gain and offsets [1], [7], where the
authors assume that the offset is constant (uniform) for all
symbols in the word. A drawback of the Pearson distance-
based method is that the required number of operations grows
exponentially with the word length, n, and alphabet size, q,
so that, as a result, for larger values of nand qthe method
becomes an impracticability.
In this paper, we investigate detection schemes of q-ary
words that are based on the results of modern cluster analysis.
We assume distortion of the symbols received by additive
noise, and we further assume that the channel characteristic is
not completely known to both sender and receiver. Detection is
based on the observation of a single word of nsymbols only,
and the observation of past or future symbols is not assumed.
Cluster analysis is a well-known technique, but its application
to q-ary data detection in conjunction with a mismatched
channel has, to the best of the authors’ knowledge, not been
reported in the literature.
We start in Section II with preliminaries and a description
of the mismatched channel model. Fixed threshold detection
is presented in Section III. In Section IV, we present the
outcomes of our first experiments with a dynamic threshold
detection scheme based on k-means clustering. Computer
simulations are conducted to assess the error performance of
the prior art and new schemes developed. Section V concludes
this paper.
II. PRELIMINARIES AND CHANNEL MODEL
We consider a communication codebook, S ∈ Qn, of n-
symbol words x= (x1, x2, . . . , xn)over the q-ary alphabet
Q={0, . . . , q 1}, where n, the length of x, is a positive
integer. The word, x∈ S, is translated into physical features,
where logical symbols are written at an average (physical)
level i+bi, where biR,0iq1. The average
deviations, or ‘offsets’, denoted by bi, are unknown to the
detector, and may slowly vary (drift) in time due to charge
leakage or temperature change. We assume that the offsets
biare relatively small with respect to the assumed, unity,
difference (or amplitude) between the two physical signal
levels. It is further assumed that the offsets, bi, may be
different for each received word, but do not vary within a
word. For unambiguous detection, the average of the physical
levels associated with the logical symbol i’ is assumed to be
less than that associated with the logical symbol i+ 1. In
other words, we have the premise
b0<1 + b1<2 + b2< . . . < q 1 + bq1(1)
or
bi1bi<1,1iq1.(2)
Assume a word, x, is sent. The symbols, ri, of the retrieved
vector r= (r1, . . . , rn)are distorted by additive noise and
given by
ri=xi+bxi+νi.(3)
We analyze the above case where the unknown offsets, bi’s,
are uncorrelated. We assume that the received vector, r, is
corrupted by additive Gaussian noise ν=(ν1, . . . , νn), where
νiRare zero-mean independent and identically distributed
(i.i.d) noise samples with normal distribution N(0, σ2). The
quantity σ2Rdenotes the additive noise variance.
III. FIX ED T HR ES HO LD D ET EC TION (FTD)
The symbols of the received word, ri, can be straight-
forwardly quantized to an integer, ˆxi, with a conventional
fixed threshold detector (FTD), also called symbol-by-symbol
detector. The threshold function is denoted by ˆxi= Φϑ(ri),
ˆxi∈ Q, where the threshold vector ϑ= (ϑ0, . . . , ϑq2)has
q1(real) elements, called thresholds. The threshold vector
satisfies the order
ϑ0< ϑ1< ϑ2< . . . < ϑq2.(4)
The quantization function, Φϑ(u), of the threshold detector is
defined by
Φϑ(u) =
0, u < ϑ0,
i,ϑi1u < ϑi,1iq2,
q1,uϑq2.
(5)
For a fixed threshold detector the q1detection thresholds
values, ϑi, are equidistant at the levels
ϑi=1
2+i, 0iq2.(6)
Threshold detection is very attractive for its implementation
simplicity. However, Immink & Weber [1] showed that the
error performance seriously degrades in the face of channel
mismatch. Dynamically adjusting the thresholds is an alter-
native that offers solace in the face of channel mismatch. In
the next section, we investigate machine learning based on
clustering techniques that are suitable for accomplishing the
detection within a more practical setting.
IV. DATA DE TE CT IO N US IN G k-MEANS CLUSTERING
A. Basics of k-means clustering
The k-means clustering technique aims to partition the n
received symbols into ksets V={V0, V1, . . . , Vk1}so as to
minimize the within-cluster sum of squares defined by
arg min
V
k1
i=0
rjVi
(rjµi)2,(7)
where the centroid µiis the mean of the received symbols in
Vi, or
µi=1
|Vi|
rjVi
rj.(8)
The problem of choosing the correct number of clusters is
hard, and numerous prior art publications are available to
facilitate this choice [4], [5]. Here we assume that a cluster is
associated with one of the ksymbol values, that is k=q.
The k-means clustering algorithm is an iteration process
that finds (or tries to find) a solution of (7). The initial sets
V(1)
i,0ik1, are empty. The superscript integer in
parentheses, (t), denotes the iteration tally. We initialize the
kcentroids µ(1)
i,0ik1, by a reasonable choice.
For example, Forgy’s method [8] randomly chooses ksymbols
(assuming k > n), ri, from the received vector r, and uses
these as the initial centroids µ(1)
i,1ik. The choice
of the initial centroids has a significant bearing on the error
performance of the clustering detection technique. We do not
follow Forgy’s approach, and try, dependent on the specific
situation at hand, to develop more suitable initial centroids
µ(1)
i’s. We assume that we order the centroids such that
µ(t)
0< µ(t)
1< . . . < µ(t)
q1.(9)
After the initialization step, we iterate the next two steps until
the symbol assignments no longer change.
Assignment step: Assign the nreceived symbols, ri, to
the ksets V(t+1)
j. If ri,1in, is closest to µ(t)
, or
= arg min
jriµ(t)
j2
,(10)
then riis assigned to V(t+1)
. The (temporary) word,
denoted by
ˆ
x(t)= (ˆx(t)
1, . . . , ˆx(t)
n),(11)
is found by
ˆx(t)
i=ϕV(t)(ri),1in, (12)
where ϕV(t)(ri) = jsuch that riV(t)
j.
Updating step: Compute updated versions of the k
means µ(t+1)
j, j ∈ Q. An update of the new means µ(t+1)
j
is found by
µ(t+1)
j=1
|V(t+1)
j|
riV(t+1)
j
ri, j ∈ Q,(13)
where it is understood that if |V(t+1)
j|= 0 that µ(t+1)
j=
µ(t)
j(that is, no update).
After running the above routine until the temporary word is
unchanged, say at iteration step, t=to, we have ˆ
x(to1) =
ˆ
x(to). Then we have found the final estimate of the sent
word, xo=ˆ
x(to). Bottou [9] showed that the k-means cluster
algorithm always converges to a simple steady state, and limit
cycles do not occur. It is possible, however, that the process
reaches a local minimum of the within-cluster sum of squares
(7).
B. Assignment step: relation with threshold detection
We take a closer look at the assignment step of the k-means
clustering technique, given by (10). Considering the order (9)
of the centroids µ(t)
j, we simply infer that the symbol rilies
between, say, µ(t)
uriµ(t)
u+1,0uq2. Thus
= arg min
j∈{u,u+1}riµ(t)
j2
As
riµ(t)
u2riµ(t)
u+12
= 2riµ(t)
u+1 µ(t)
u+µ(t)
u2µ(t)
u+12
=µ(t)
u+1 µ(t)
u2riµ(t)
uµ(t)
u+1,(14)
we obtain
=
u,ri<µ(t)
u+1+µ(t)
u
2,
u+ 1,ri>µ(t)
u+1+µ(t)
u
2.
(15)
Using (5), we yield
= arg min
jriµ(t)
j2= Φˆ
ϑ(ri),(16)
where the threshold vector, ˆ
ϑ, is given by
ˆ
ϑi=µ(t)
i+1 +µ(t)
i
2,0iq2,(17)
and the intermediate vector, ˆ
x(t), is given by
ˆx(t)
i= Φˆ
ϑ(ri),1in. (18)
We conclude that the k-means cluster detection method is a
dynamic threshold detector, where at each update the threshold
TABLE I
HIS TOG RAM O F TH E NUM BE R OF IT ER ATION S FO R q= 4,n= 64,AN D
σb= 0.1.
toSNR = 17 dB SNR = 20 dB
1 91.43 99.70
2 8.37 0.30
3 0.20 0
vector, ˆ
ϑ, is updated with the means of the members of each
cluster using (13).
In the next section, we report on the outcomes of computer
simulations using channel model (3).
C. Results of simulations
We study the channel model (3), where we assume that the
stochastic deviations from the means, bi,i∈ Q, are taken from
a zero-mean continuous uniform distribution with variance σ2
b.
Thus, the bi’s lie within the range 3σbbi3σb.
Given premise (2), we have σb<1
23. We simply initialize
the centroids to the standard values, that is, by µ(1)
i=i,i∈ Q,
and iterate the assignment and updating steps as outlined in
the above algorithm. Figure 1 shows outcomes of computer
simulations, where we compare the word error rate (WER),
that is, the probability that a word is received in error, of fixed
threshold detection and detection based on k-means clustering
classification versus the signal-to-noise ratio (SNR) defined by
=20 log σ. We plotted two different cases, namely σb= 0
(ideal channel) and σb= 0.1. As a further comparison we
plotted the upper bound of the word error rate of a threshold
detector for an ideal channel, given by [1]
WER <2(q1)
qnQ 1
2σ.(19)
We infer that in case the channel is ideal, σb= 0, that the error
performance of k-means clustering detection is close to the
performance of theory and simulation practice of conventional
fixed threshold detection. In case the channel is not ideal, we
assume here σb= 0.1,k-means clustering detection is superior
to threshold detection.
The number of iterations, which is an important (time)
complexity issue, depends on q,n, and the signal-to-noise
ratio, SNR. The convergence of the iteration process is guar-
anteed [9], but the speed of convergence is an open question
that we tackled by computer simulations. Table I shows some
results of simulations for the case q= 4,n= 64, and
σb= 0.1, as depicted in Figure 1. At an SNR = 17 dB,
around 91 percent of the received words is detected without
further iterations. In 8 perecent of the detected words, only
one iteration of the threshold levels is needed. At an SNR =
20 dB, we found that all but no iterations are required. Thus,
since in the large majority of cases no iterations are needed,
we conclude that at the cost of a slight additional (time)
complexity, the proposed k-means clustering classification
outperforms fixed threshold detection.
17 17.5 18 18.5 19 19.5 20
10−5
10−4
10−3
10−2
10−1
100
WER
SNR (dB)
(c)
(a’)
(b’)
(a)
(b)
Fig. 1. Word error rate (WER) of conventional fixed threshold
detection (FTD), curve (a’), and k-means clustering detection, curve
(b’), versus SNR =20 log σ(dB) for n= 64,q= 4, and σb= 0.1.
Similar curves, (a) and (b), are shown for the case σb= 0 (ideal
channel). The upperbound to the word error rate of a fixed threshold
detector (FTD) for the ideal channel, σb= 0, given by (19), curve
(c), for q= 4 and n= 64.
V. CO NC LU SI ON S
We have proposed and analyzed machine learning based
on a k-means clustering technique as a detection method of
words of q-ary symbols. We have analyzed the detection of
distorted data retrieved from a data storage medium where user
data is stored as physical features with qdifferent levels. Due
to manufacturing tolerances and/or ageing, the qlevels may
differ from the desired, nominal, ones. Results of simulations
have been presented, where the qunknown level offsets are
independent stochastic variables with a uniform probability
distribution.
REF ER EN CE S
[1] K. A. S. Immink and J. H. Weber, “Minimum Pearson Distance
Detection for Multi-Level Channels with Gain and/or Offset Mismatch,
IEEE Trans. Inform. Theory, vol. IT-60, pp. 5966-5974, Oct. 2014.
[2] K. Cai and K. A. S. Immink, “Cascaded Channel Model, Analysis, and
Hybrid Decoding for Spin-Torque Transfer Magnetic Random Access
Memory (STT-MRAM),IEEE Trans on Magn., vol. MAG-53, pp. 1-
11, Nov. 2017.
[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press,
2016.
[4] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of
clusters in a data set via the gap statistic,” Journal of the Royal Statistical
Society, vol. 63, pp. 411-423, 2001.
[5] A. D. Gordon, Classifcation, 2nd edn., London: Chapman and Hall-
CRC, 1999.
[6] K. A. S. Immink, “A Survey of Codes for Optical Disk Recording,” IEEE
J. Select. Areas Commun., vol. 19, no. 4, pp. 756-764, April 2001.
[7] K. A. S. Immink, K. Cai, and J. H. Weber, “Dynamic Threshold
Detection Based on Pearson Distance Detection,” IEEE Trans. Commun.,
vol. COM-, pp. , 2018.
[8] E. W. Forgy, “Cluster analysis of multivariate data: efficiency versus
interpretability of classifications,” Biometrics, vol. 21, pp. 768-769,
(1965).
[9] L. Bottou and Y. Bengio, “Convergence properties of the k-mean
algorithms,” Advances in Neural Information Processing Systems 7, pp.
585-592, MIT Press, 1995.
Article
In many channels, the transmitted signals do not only face noise, but offset mismatch as well. In the prior art, maximum likelihood (ML) decision criteria have already been developed for noisy channels suffering from signal independent offset . In this paper, such ML criterion is considered for the case of binary signals suffering from Gaussian noise and signal dependent offset . The signal dependency of the offset signifies that it may differ for distinct signal levels, i.e., the offset experienced by the zeroes in a transmitted codeword is not necessarily the same as the offset for the ones. Besides the ML criterion itself, also an option to reduce the complexity is considered. Further, a brief performance analysis is provided, confirming the superiority of the newly developed ML decoder over classical decoders based on the Euclidean or Pearson distances.
Article
Full-text available
We consider the transmission and storage of encoded strings of symbols over a noisy channel, where dynamic threshold detection is proposed for achieving resilience against unknown scaling and offset of the received signal. We derive simple rules for dynamically estimating the unknown scale (gain) and offset. The estimates of the actual gain and offset so obtained are used to adjust the threshold levels or to re-scale the received signal within its regular range. Then, the re-scaled signal, brought into its standard range, can be forwarded to the final detection/decoding system, where optimum use can be made of the distance properties of the code by applying, for example, the Chase algorithm. A worked example of a spin-torque transfer magnetic random access memory (STT-MRAM) with an application to an extended (72, 64) Hamming code is described, where the retrieved signal is perturbed by additive Gaussian noise and unknown gain or offset.
Article
Full-text available
The performance of certain transmission and storage channels, such as optical data storage and nonvolatile memory (flash), is seriously hampered by the phenomena of unknown offset (drift) or gain. We will show that minimum Pearson distance (MPD) detection, unlike conventional minimum Euclidean distance detection, is immune to offset and/or gain mismatch. MPD detection is used in conjunction with (T) -constrained codes that consist of (q) -ary codewords, where in each codeword (T) reference symbols appear at least once. We will analyze the redundancy of the new (q) -ary coding technique and compute the error performance of MPD detection in the presence of additive noise. Implementation issues of MPD detection will be discussed, and results of simulations will be given.
Article
Full-text available
Codes were designed for optical disk recording system and future options were explored. The designed code was a combination of dc-free and runlength limited (DCRLL) codes. The design increased minimum feature size for replication and sufficient rejection of low-frequency components enabling a simple noise free tracking. Error-burst correcting Reed-Solomon codes were suggested for the resolution of read error. The features of DCRLL and runlength limited (RLL) sequences was presented and practical codes were devised to satisfy the given channel constraints. The mechanism of RLL codes supressed the components of the genarated sequences. The construction and performance of alternative Eight to fourteen modulation (EFM)-like codes was studied.
Article
We propose a method (the \Gap statistic") for estimating the numberof clusters (groups) in a set of data. The technique uses the outputof any clustering algorithm (e.g. k-means or hierarchical), comparingthe change in within cluster dispersion to that expected under an appropriatereference null distribution. Some theory is developed forthe proposal and a simulation study that shows that the Gap statisticusually outperforms other methods that have been proposed in the literature.We also...
  • A D Gordon
A. D. Gordon, Classifcation, 2nd edn., London: Chapman and Hall-CRC, 1999.