Page 1

28 Current Bioinformatics, 2009, 4, 28-40

1574-8936/09 $55.00+.00 © 2009 Bentham Science Publishers Ltd.

Digital Signal Processing in the Analysis of Genomic Sequences

Juan V. Lorenzo-Ginori*,1, Aníbal Rodríguez-Fuentes1, Ricardo Grau Ábalo2 and Robersy Sánchez

Rodríguez3

1Centro de Estudios de Electrónica y Tecnologías de la Información, Facultad de Ingeniería Eléctrica, Universidad

Central “Marta Abreu” de Las Villas, Carretera a Camajuaní Km. 5 ?, 54830 Santa Clara, Villa Clara, Cuba; 2Centro

de Estudios de Informática, Facultad de Ingeniería Eléctrica, Universidad Central “Marta Abreu” de Las Villas, Car-

retera a Camajuaní Km. 5 ?, 54830 Santa Clara, Villa Clara, Cuba; 3Instituto Nacional de Investigaciones en Viandas

Tropicales, (INIVIT), Biotechnology Group, Santo Domingo, Villa Clara, Cuba

Abstract: Digital Signal Processing (DSP) applications in Bioinformatics have received great attention in recent years,

where new effective methods for genomic sequence analysis, such as the detection of coding regions, have been devel-

oped. The use of DSP principles to analyze genomic sequences requires defining an adequate representation of the nucleo-

tide bases by numerical values, converting the nucleotide sequences into time series. Once this has been done, all the

mathematical tools usually employed in DSP are used in solving tasks such as identification of protein coding DNA re-

gions, identification of reading frames, and others. In this article we present an overview of the most relevant applications

of DSP algorithms in the analysis of genomic sequences, showing the main results obtained by using these techniques,

analyzing their relative advantages and drawbacks, and providing relevant examples. We finally analyze some perspec-

tives of DSP in Bioinformatics, considering recent research results on algebraic structures of the genetic code, which sug-

gest other new DSP applications in this field, as well as the new field of Genomic Signal Processing.

Keywords: Digital Signal Processing, genomic sequences, coding regions.

INTRODUCTION

Digital Signal Processing (DSP) is an area of science and

engineering that has developed during the past 40 years as a

result of the constant evolution of computer science and

technology. DSP comprehends the representation, transfor-

mation and manipulation of digital signals as well as the in-

formation associated to them. In this context, signals are

usually physical magnitudes that vary in time or space, and

digital signals are those represented as sequences of num-

bers, as in the case of time series.

The discipline of DSP uses a set of mathematical tools to

analyze and process signals, among them can be mentioned

the Discrete Fourier Transform, the Z transform, Digital Fil-

ters, Parametric Models, the Wavelet Transform, Correlation

Functions and others. When considering the informational

content of signals, other concepts from Information Theory

such as entropy and mutual information are also used.

A key concept in DSP is the possibility of representing

the signals in the frequency domain making use of the Dis-

crete Fourier Transform. This representation leads to some

important signal properties that are not revealed in the time

domain, which are associated to their frequency spectrum.

In the case of the genomic sequences, these have been

represented mathematically by character strings of symbols

from a size-4 alphabet consisting of the letters A, T, G and

C, which represent each one of the nucleotide bases. In the

case of proteins, the alphabet size is 20, corresponding to the

*Address correspondence to this author at the Centro de Estudios de Elec-

trónica y Tecnologías de la Información, Facultad de Ingeniería Eléctrica,

Universidad Central “Marta Abreu” de Las Villas, Carretera a Camajuaní

Km. 5 ?, 54830 Santa Clara, Villa Clara, Cuba;

E-mail: juanl@uclv.edu.cu

possible amino acids. The possibility of finding a wide ap-

plication of DSP techniques to the analysis of genomic se-

quences arises when these are converted appropriately into

numerical sequences, for which several rules have been de-

veloped. Notice that genomic signals do not have time or

space as the independent variable, as occur with most physi-

cal signals.

This paper is organized in the following way. Firstly an

overview of the main DSP algorithms used in applications to

genomic sequence analysis is shown: digital filters, the Dis-

crete Fourier Transform (DFT), the Short-Time Fourier

Transform (STFT), parametric models (AR, MA, ARMA),

Wavelet Transform and the Information Theory concept of

entropy. Hidden Markov Models can be considered also as a

DSP tool, but this topic will not be covered, as there is a re-

cent comprehensive review article by De Fonzo et al. [1].

Then the numerical representation of genomic sequences is

presented. This allows the application of DSP tools to study

genomic sequences. After this, a review of the major appli-

cations of DSP to the analysis of genomic sequences is real-

ized, such as identification of protein coding DNA regions,

identification of reading frames, location of splice sites and

others. We finally review the perspectives of DSP in this

field, considering recent research results on algebraic struc-

tures of the genetic code and the new field of Genomic Sig-

nal Processing.

MAIN DSP ALGORITHMS EMPLOYED IN THE

ANALYSIS OF GENOMIC SEQUENCES

In this section a synthetic overview of the main DSP al-

gorithms that have been used in the analysis of genomic se-

quences is presented. There are excellent books on DSP the-

ory by Oppenheim and Schafer [2] and Proakis and Mano-

lakis [3].

Page 2

Digital Signal Processing in the Analysis of Genomic Sequences Current Bioinformatics, 2009, Vol. 4, No. 1 29

A) Digital Filters

A digital filter is a particular class of discrete system ca-

pable of realizing some transformation to an input discrete

numerical sequence. There are different classes of digital

filters according to the properties of their input-output rela-

tionships, as for example linear, nonlinear, time-invariant or

adaptive. The basic, frequency selective digital filters, are

linear and time-invariant (LTI) discrete systems.

Digital filters are characterized by numerical algorithms

that can be implemented in any class of digital processors. In

particular, LTI digital filters can pertain to one of two cate-

gories, according to the duration of their response to the im-

pulse, or Dirac delta function, when it is used as the input

signal: infinite (IIR) or finite (FIR) impulse response. The

input-output relationships for IIR digital filters are character-

ized and implemented algorithmically through a finite differ-

ence equation of the form

][][

00

knxbknya

M

k

?

=

k

N

k

?

=

k

?=?

, (1)

where x[n] and y[n] are the input and output numerical se-

quences respectively, ak and bk are numerical coefficients, n

is the sample index, and k is an integer delay with maximum

values N and M for the output and input sequences respec-

tively. On the other hand FIR digital filters are characterized

by a discrete convolution operation of the form

][][][

1

0

mnxmhny

N

m

?=?

?

=

(2)

In this equation, h[m] is the impulse response of the fil-

ter, which has a length of N samples. The bilateral Z trans-

form operator is defined as

?

??=

n

?

?

=

n

znxnxZ

][ ]}[{

(3)

where z is a complex variable. When this operator is applied

to equations (1) or (2), the system transfer function in the Z-

transform domain is obtained. The system transfer function

relates the input and output sequences x[n] and y[n], through

their respective Z transforms X[z] and Y[z]. The transfer

function has the general form

k

N

k

?

k

k

M

k

?

k

za

zb

zX

zY

zH

?

=

?

=

==

0

0

)(

)(

)(

(4)

The transfer function H(z) for this class of systems is a

ratio of polynomials in the complex variable z and has a

convergence region associated to it, which is closely related

to the positions of its poles in the complex Z plane. A prop-

erty of the transfer function of LTI systems is that the com-

plex exponential sequences of the form

][

nienx

?

=

where i is the imaginary unit, are eigenfunctions of these

systems, and this lead to the concept that these systems have

an associated frequency response, which can be obtained by

equating

z =

in equation (4), i.e.

?

ie

H(ei?)= H(z)]z=ei? (5)

The presence of the imaginary unit in the exponent im-

plies that H(ei?) is a complex function in the frequency do-

main, whose frequency response is usually expressed as a

magnitude response together with a phase, or angle response.

The system transfer function is periodic in ? (emphasizing

this periodicity is the reason for using ei?, instead of simply

?, as the argument of H), and it is usually plotted for its val-

ues in the main interval -???<?. An example of a sharp

resonance peak in the magnitude response of an IIR filter is

shown in Fig. (1), together with the corresponding phase

response. The sharp magnitude peak means a high selectivity

in frequency. The phase response of this filter is highly non-

linear (lower graph) and this nonlinearity tends to produce a

high signal distortion.

Fig. (1). Frequency response in magnitude and phase of an IIR

system exhibiting a sharp peak in the magnitude response.

A variety of digital filter design techniques allow to ob-

tain any desired magnitude response with frequency selectiv-

ity properties, whereas it is desired that the phase response

be a linear function of ?, in order to have low distortion.

According to the frequency interval (band) transmitted, the

magnitude of the basic ideal prototype filter frequency re-

sponses, can be lowpass, highpass, bandpass and bandstop.

A combination of these responses leads to a multiband filter.

The typical ideal frequency responses (in magnitude) of the

prototype filters are shown in Fig. (2). These ideal responses

can be only approximated in practical filters, where better

approximations in general are obtained by increasing the

order of H(z), which means a higher computational complex-

ity of the digital filters.

Constant magnitude response together with perfect line-

arity in the phase response is the condition for signal trans-

Page 3

30 Current Bioinformatics, 2009, Vol. 4, No. 1 Lorenzo-Ginori et al.

mission without distortion through a filter in the desired fre-

quency band. IIR digital filters have in general a nonlinear

phase response, that depends on the design method em-

ployed. On the other hand, a property of FIR digital filters is

that they can exhibit a perfect linear phase response under

certain conditions of symmetry in their impulse response.

This has been a motivation for the use of digital FIR filters in

many applications.

Fig. (2). Frequency response in magnitude for the prototype ideal

filters: lowpass, highpass, bandpass and bandstop.

B) Discrete Fourier Transform

The Discrete Fourier Transform is a mathematical

operation that transforms one discrete, limited (finite) N

duration function into another function, according to

?

=

n

?

?

=

1

0

2

N

][][

N

nki

enxkX

?

, 0? n, k ?N-1 (6)

The function X[k] is the Discrete Fourier Transform

(DFT) of the sequence x[n] and constitutes the frequency

domain representation of x[n], which is usually (or

conventionally considered) a function in the time domain.

The Discrete Fourier Transform only evaluates the frequency

components required to reconstruct the finite segment of the

sequence that was analyzed. In general, the DFT is a

function in the complex domain as a result of the complex

exponential in the right side of equation (6), and for the

particular case of real sequences, it will be a sequence of

complex numbers of the same length as x[n]. The DFT is

usually represented in terms of the corresponding magnitude

and phase functions that constitute the frequency spectrum of

the sequence x[n].

The Discrete Fourier transform is a very useful tool, be-

cause it can reveal periodicities in the input data as well as

the relative intensities of these periodic components. An ex-

ample of the magnitude and phase graphs of the 64-points

DFT for a sum of two pure sinusoids at discrete frequencies

14/2?

and

15/4?

is shown in Fig. (2). Each discrete value

of the DFT is usually called a DFT coefficient.

The DFT, however, suffer from three important draw-

backs as a tool for spectral analysis: a) Spectral leakage,

which means the presence of energy in zones where the

spectrum should be zero (this is clearly seen in Fig. (3): two

pure frequencies are analyzed while many nonzero samples

are obtained in the spectrum at other frequencies); b) the

frequency response of the DFT coefficients is not constant

with frequency (“picket-fence” effect), and c) the spectral

resolution, or ability to separate frequency lines that are

close in frequency, depends inversely upon the length of the

sequence in the time domain. This means that the DFT can-

not distinguish appropriately close spectral components for

time signals of short duration. Multiplying the time signals

by special weighting functions called windows, and control-

ling the signal length, can help in overcoming these limita-

tions in some extent.

Fig. (3). Example of DFT frequency spectrum (magnitude and

phase) for two sinusoids closely spaced in frequency. Frequency

axis is normalized to fs/N, where fs is the sampling frequency and N

the number of samples in the sequence (64 in this example).

Using the DFT for spectral analysis of random signals (or

stochastic processes) require certain considerations to obtain

a statistically valid result.

For stationary random signals, a commonly employed

procedure to obtain a power spectral density (PSD) function

in the frequency domain is the Welch’s modified perio-

dograms method. The PSD function is obtained in this case

by calculating the mean value of the squared DFT coeffi-

cients at each frequency value, for adjacent and usually over-

lapping windowed signal segments. The measure obtained in

this way is a consistent estimate of the power spectrum. A

typical spectrum obtained by the Welch’s method, for a pure

sinusoid embedded in white Gaussian noise, is shown in Fig.

(4). Notice the peak that corresponds to the sinusoid, whose

magnitude is significantly greater than the noisy background.

Page 4

Digital Signal Processing in the Analysis of Genomic Sequences Current Bioinformatics, 2009, Vol. 4, No. 1 31

Fig. (4). An example of PSD spectrum obtained through Welch’s

method, for a sinusoid embedded in white, Gaussian noise.

In the case of non-stationary signals, The Short Time

Fourier Transform (STFT) is an algorithm frequently used

for the DFT-based spectral analysis. In the STFT, the time

signal is divided into short segments (usually overlapped)

and a DFT is calculated for each one of these segments. A

three dimensional graph called spectrogram is obtained by

plotting the squared magnitude of the DFT coefficients as a

function of time. This squared magnitude is usually repre-

sented by the brightness of the graph, as shown in Fig. (5).

Fig. (5). Spectrogram of a harmonic signal whose frequency varies

linearly with time (“linear chirp”).

An important special case of the STFT is the Gabor

Transform, in which a Gaussian weighting window is ap-

plied to the analyzed time sequence. This procedure allows

obtaining a better simultaneous resolution in time and fre-

quency.

C) Spectral Analysis Using Parametric Models

Parametric spectral analysis is a method that can be used

in many cases with some advantages over the non-parametric

methods. Its advantages rely in that it is possible to obtain a

parametric description of the second-order statistics of a ran-

dom sequence, by assuming a certain production model for

it. A comprehensive analysis of such methods is given in

Stoica and Moses [4].

Spectral analysis using parametric methods does not suf-

fer from the limitations in spectral resolution that character-

ize the DFT-based methods, because they do not imply a

windowing (segment selection) process.

The mathematical expression of the PSD function of a

random sequence is described in this case in terms of the

model parameters, and the variance of a white (constant

PSD) random noise process used as the input signal of the

model. In consequence, the values to be computed in this

method are the parameters of the model and the variance of

the input process.

The general expression for the transfer function of the

model in parametric spectral analysis is analogous to that of

a digital filter as shown in equation (3), which is expressed

as the ratio of polynomials in the complex variable z

k

p

k

?

k

k

q

k

?

k

za

zb

zA

zB

zH

?

=

?

=

+

==

1

0

1

)(

)(

)(

(7)

to which corresponds the equation in finite differences

][][][

01

knwbknxanx

q

k

?

=

k

p

k

?

=

k

?+??=

(8)

in which w[n] is the input sequence and the observed data

x[n] represent the model’s output. Equations (7) and (8) are

related through the Z transform operator shown in equation

(3). The PSD function is obtained from (7) using (5) to ob-

tain the model’s frequency response, and is given by

?xx(?)= H(ei?)

2?ww(?) (9)

In equation (9) H(ei?) is the frequency response of the

model, while ?ww and ?xx are respectively the PSD functions

of the corresponding input and output signals. For a white-

noise input,

2

w

2

)()(

i

xx

eH

??

?

=?

(10)

where

2

w

? is the input noise variance.

According to the characteristics of the PSD for the ana-

lyzed random sequence there are three types of parametric

models:

•

Autoregressive (AR) models, corresponding to the

particular case {}

0

=

for k > 0, resulting in an all-

pole transfer function.

kb

•

Moving average (MA) models, which correspond

to{}

0

=

, resulting in an all-zero transfer function.

k a

•

Autoregressive, moving average (ARMA) models,

which is the general case in which there are poles and

zeros in the model’s transfer function.

Page 5

32 Current Bioinformatics, 2009, Vol. 4, No. 1 Lorenzo-Ginori et al.

There is equivalence between the three types of models if

the order is selected appropriately, i. e., a process which is

inherently AR of a certain order, can be described by an MA

model of higher order. However, AR models are more used

because of the relative simplicity in calculating the model’s

parameters through the Yule-Walker equations. Fig. (6)

shows the PSD curve for a typical AR spectrum.

Fig. (6). A typical PSD function obtained for an AR model, exhibit-

ing two peaks corresponding to two pairs of complex conjugate

poles in the model’s transfer function.

D) Discrete Wavelet Transform

The Discrete Wavelet Transform (DWT) is a mathemati-

cal tool that can be used very effectively for non-stationary

signal analysis. There is a great amount of literature on

DWT, see for example Burrus et al. [5].

In DWT analysis, a signal x(t) can be described through a

linear decomposition as

??

=

)()(

,,

tatx

kj

kj

kj?

(11)

In this equation j,k ? ? are integer indexes, aj,k are the

wavelet coefficients of the expansion, and ?j,k is a set of

wavelet functions in t. Notice that the wavelet coefficients

j a,constitute a discrete set, and that the coefficient’s values

are calculated according to

k

dtttxttxa

kjkjkj

?

??

+?

>= =<

)()

?

()()

?

(

,,,

(12)

The DWT obtains the decomposition of the signal x[n]

into a set of orthonormal wavelets and their associated scal-

ing functions ?j,k that constitute a wavelet basis. These func-

tions can belong to different wavelet families that are ex-

pressed by the functions ?j,k which can be generated by dila-

tions and translations of a basic (“mother”) wavelet. These

dilations and translations are discrete, and the indexes j and k

are respectively related to these processes, that can be ex-

pressed as

()

ktt

jj

kj

?=

??

22)(

2/

,

??

, j,k? ? (13)

In Eq. (13) the functions ?j,k are dilated in a dyadic form

(in powers of two), when varying the values of the index j,

and in analogous way translated when varying the index k. In

this process, translation is associated with time resolution,

and dilation provides scaling, a concept closely related here

to frequency resolution.

Wavelet functions must satisfy the conditions

0)( lim

?

t

,

=

?

t

ji

?

(14)

and

?

??

?

= 0)(

,

dtt

ji

?

. (15)

In these conditions, (14) implies decay, and (15) implies

oscillations like a wave function. Fig. (7) shows examples of

wavelets functions that are well described in the literature.

Fig. (7). Examples of wavelets: (a) Daubechies Db10, (b) Coiflet

Coif5.

The DWT, for which an algorithm called Fast Wavelet

Transforms (FWT) allows a very efficient calculation, plays

currently a central role in many DSP applications. The result

of the DWT is a multi-resolution decomposition, in which at

each level the signal is decomposed in “approximation” and

“detail” coefficients. This decomposition is realized through

a process that is equivalent to lowpass and highpass filtering

for the approximation and for the details respectively, using

special digital filters called “Quadrature Mirror Filters”

(QMF.) There are two types of QMF filters: the lowpass

scaling filter h, and the highpass wavelet filter g. The g filter

is equivalent to the h filter reversed in time and alternating

the signs of its coefficients. DWT decompositions can be

depicted by a tree structure as shown in Fig. (8), where ap-

proximation and detail coefficients are represented. Each one

of the J decomposition levels corresponds to a certain dila-

tion j, whereas the index k determines the corresponding

translations. The DWT can be also extended to non-

orthogonal decompositions.