# REBMEC: Repeat Based Maximum Entropy Classifier for Biological Sequences.

**ABSTRACT** An important problem in biological data analysis is to pre- dict the family of a newly discovered sequence like a pro- tein or DNA sequence, using the collection of available se- quences. In this paper we tackle this problem and present REBMEC, a Repeat Based Maximum Entropy Classifier of biological sequences. Maximum entropy models are known to be theoretically robust and yield high accuracy, but are slow. This makes them useful as benchmarks to evaluate other classifiers. Specifically, REBMEC is based on the classical Generalized Iterative Scaling (GIS) al- gorithm and incorporates repeated occurrences of subse- quences within each sequence. REBMEC uses maximal frequent subsequences as features but can support other types of features as well. Our extensive experiments on two collections of protein families show that REBMEC performs as well as existing state-of-the-art probabilistic classifiers for biological sequences without using domain- specific background knowledge such as multiple align- ment, data transformation and complex feature extraction methods. The design of REBMEC is based on generic ideas that can apply to other domains where data is orga- nized as collections of sequences.

**0**Bookmarks

**·**

**63**Views

- Citations (38)
- Cited In (0)

- [Show abstract] [Hide abstract]

**ABSTRACT:**Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on develop- ing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multi- class problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vs- the-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weight- ing of binary classifiers that improves multi-class prediction with respect to a fixed set of out- put codes. We use a cross-validation set-up to generate output vectors for training, and we de- fine codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here us- ing a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in pro- tein sequence analysis, and find that our method strongly outperforms it on every structure clas-Journal of Machine Learning Research 01/2007; 8:1557-1581. · 3.42 Impact Factor - SourceAvailable from: Golan Yona[Show abstract] [Hide abstract]

**ABSTRACT:**We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Incorporating basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can improve the performance in some cases. The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on one of the state of the art databases of protein families, namely, the Pfam database of HMMs, with satisfactory performance. 1 Introduction In the last few years there is growing effort to org...04/1999; - [Show abstract] [Hide abstract]

**ABSTRACT:**In many machine learning applications that deal with se- quences, there is a need for learning algorithms that can effectively utilize the hierarchical grouping of words. We introduce Word Taxonomy guided Naive Bayes Learner for the Multinomial Event Model (WTNBL-MN) that exploits word taxonomy to generate compact classifiers, and Word Taxonomy Learner (WTL) for automated construction of word taxon- omy from sequence data. WTNBL-MN is a generalization of the Naive Bayes learner for the Multinomial Event Model for learning classifiers from data using word taxonomy. WTL uses hierarchical agglomerative clustering to cluster words based on the distribution of class labels that co-occur with the words. Our experimental results on protein localiza- tion sequences and Reuters text show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard Naive Bayes learner for the Multinomial Model.Abstraction, Reformulation and Approximation, 6th International Symposium, SARA 2005, Airth Castle, Scotland, UK, July 26-29, 2005, Proceedings; 01/2005

Page 1

REBMEC: Repeat Based Maximum Entropy Classifier for

Biological Sequences

Pratibha RaniVikram Pudi

Center for Data Engineering

International Institute of Information Technology

Hyderabad, India

pratibha rani@research.iiit.ac.in

Center for Data Engineering

International Institute of Information Technology

Hyderabad, India

vikram@iiit.ac.in

Abstract

An important problem in biological data analysis is to pre-

dict the family of a newly discovered sequence like a pro-

tein or DNA sequence, using the collection of available se-

quences. In this paper we tackle this problem and present

REBMEC, a Repeat Based Maximum Entropy Classifier

of biological sequences. Maximum entropy models are

known to be theoretically robust and yield high accuracy,

but are slow. This makes them useful as benchmarks to

evaluate other classifiers. Specifically, REBMEC is based

on the classical Generalized Iterative Scaling (GIS) al-

gorithm and incorporates repeated occurrences of subse-

quences within each sequence. REBMEC uses maximal

frequent subsequences as features but can support other

types of features as well. Our extensive experiments on

two collections of protein families show that REBMEC

performs as well as existing state-of-the-art probabilistic

classifiers for biological sequences without using domain-

specific background knowledge such as multiple align-

ment, data transformation and complex feature extraction

methods. The design of REBMEC is based on generic

ideas that can apply to other domains where data is orga-

nized as collections of sequences.

1

A critical problem in biological data analysis is to clas-

sify bio-sequences based on their important features and

functions. This problem is important due to the exponen-

tial growth and accumulation of newly generated sequence

data during recent years [36], which demands for automatic

methods for sequence classification. Predicting the fam-

ily of an unclassified sequence reduces the time and cost

required for performing laboratory experiments to deter-

mine its properties like functions and structure because se-

quences belonging to the same family have similar charac-

teristics.

Introduction

International Conference on Management of Data

COMAD 2008, Mumbai, India, December 17–19, 2008

c ?Computer Society of India, 2008

The known state-of-the-art solutions for this problem

mainly use approaches such as Sequence Alignment [2, 22,

29], Hidden Markov Models (HMM) [11, 19], Probabilistic

Suffix Trees (PST) [5, 12], and Support Vector Machines

(SVM) [6, 21]. Recent approaches [24, 26] have been try-

ing to improve SVM by incorporating domain knowledge,

using complex features based on structures, and combining

it with other classifiers.

In this paper, we propose a data mining based, sim-

ple but effective solution, which does not require domain

knowledge, called Repeat Based Maximum Entropy Clas-

sifier (REBMEC)–anewmaximumentropybasedclassifier

which is able to incorporate repeats of subsequences within

a sequence. REBMEC uses a novel framework based on

the classical Generalized Iterative Scaling (GIS) [10] al-

gorithm to find the maximum entropy model for a given

collection of biological sequences.

UnlikeotherBayesianclassifierslikeNaiveBayes, max-

imum entropy based classifiers do not assume indepen-

dence among the features. These classifiers build the model

of the dataset using an iterative approach to find the param-

eter values that satisfy the constraints generated by the fea-

tures and the training data [35, 32, 9]. The maximum en-

tropy principle has been widely used for discretization of

numeric values of features [18], feature selection [34, 33],

and various text related tasks like translation [7], document

classification [27], and part-of-speech tagging [33]. Our

approach is inspired by these works because comparison

betweenbiologicalsequencedataandnaturallanguagesare

commonplace [9]. REBMEC has the following desirable

features:

1. It uses a novel framework based on GIS, to find the

posterior probabilities, using the maximal frequent

subsequences as features. In doing so it adapts GIS

to deal with a large feature set.

2. It incorporates repeated occurrences of subsequences

within each sequence of a family to compute feature

probabilities.

3. It handles the problem associated with Bayesian clas-

Page 2

sifiers due to a nonuniform feature set (where all fea-

tures are not present in each class).

4. It uses an entropy based feature selection method to

find the discriminating features for a class and to re-

duce the number of irrelevant features.

5. It is scalable with the database size as it does not com-

pare the query sequences with the whole database.

6. It does not require domain knowledge based ideas

such as the use of alignment based similarity (like in

FASTA, BLAST and PSI-BLAST), complex feature

extraction, or data transformation (like in SVM).

The remainder of this paper is organized as follows: Sec-

tion 2 formally introduces the problem, and Section 3 intro-

duces maximum entropy models. Section 4 provides criti-

cal definitions used in the paper and describes the methods

for estimating feature probabilities. Section 5 presents the

problem of Bayesian classifiers that arises when all features

are not present in all classes. Section 6 describes the over-

all design of the proposed REBMEC classifier. Section 7

provides the dataset and implementation details. Section 8

describes all the experiments and results. Section 9 dis-

cusses these results. Section 10 presents the related work

and Section 11 finally concludes our work.

2

Given a training dataset D = {F1,F2,...,Fn} as a set

of n families, where each family1is a collection of se-

quences, the goal of the classifier is to label a query se-

quence S with family Fifor which the posterior probabil-

ity P(Fi|S) is maximum. Bayes formula allows us to com-

pute this probability from the prior probability P(Fi) and

the class-conditional probability P(S|Fi) as follows:

P(Fi|S) =P(S|Fi)P(Fi)

Problem Definition

P(S)

(1)

Since the evidence P(S) is common for all families, it is

ignored. P(Fi) is the relative frequency of family Fiin

D computed as P(Fi) =

sequences in family Fiand N is the total number of se-

quences in the dataset. Hence the classification problem

reduces to the correct estimation of P(S|Fi), given the

dataset D.

Ni

N, where Niis the number of

3

The maximum entropy principle is well-accepted in the

statistics community. It states that given a collection of

known facts about a probability distribution, choose a

model for this distribution that is consistent with all the

facts, but otherwise is as uniform as possible. Hence, the

chosen model does not assume any independence between

its parameters that is not reflected in the given facts. It can

Maximum Entropy Models

1In this paper, the terms “family” and “class” are used interchangeably.

be shown that there is always a unique, maximum entropy

model that satisfies the constraints imposed by the train-

ing set and choice of features, and this model will have the

exponential form [32]:

p(x) = π

n

?

i=1

(µi)fi(x)

(2)

where fi(x) = 1 if fi ⊆ x and 0 otherwise, and π is a

normalizationconstanttoguaranteethat?p(x) = 1. Each

in the model.

In a maximum entropy based classifier, the estimation

of class-conditional probabilities is done without assum-

ing independence among the features. Once features have

been selected, the task of modeling is reduced to finding the

parameter values that satisfy the constraints generated by

the selected features and the training data. The parameter

values cannot be obtained directly and must be estimated

by an iterative, hill-climbing method. Two such methods

are available: the classical Generalized Iterative Scaling

(GIS) [10] and Improved Iterative Scaling (IIS) [30].

parameter µican be viewed as the weight of the feature fi

3.1 Parameter Estimation

Both GIS and IIS algorithms require a “constraint set” CS

on the data distribution X of the domain. This constraint

set is generated by the selected features. Both algorithms

begin with all parameters initialized to one, unless a pre-

vious solution for a similar set of features is available as a

starting point. Both algorithms then update the parameters

iteratively, each iteration resulting in a new probability dis-

tribution that is closer to satisfying the constraints imposed

by the expectation values of the selected features. GIS re-

quires that for any event the sum of features will be a con-

stant. Thisrequirementcanbemetforanysetoffeaturesby

introducing a “correction” feature fl, where l = |CS| + 1,

and adding it to the constraint set [32] such that:

fl(x) = C −

|CS|

?

j=1

fj(x)

(3)

The constant C is chosen to be equal to the maximum value

for the sum of features 1 through |CS|, i.e., the size of the

feature set available for the dataset [32, 23]. IIS differs

from GIS in that it does not require model features to sum

to a constant. But study of performances of iterative meth-

ods [23] shows that although IIS runs faster than GIS, the

additionalbookkeepingoverheadrequiredbyIISmorethan

cancels any improvements in speed offered by accelerated

convergence. Since both methods result in equivalent mod-

els, we chose to build our classification framework on GIS.

3.1.1The GIS Algorithm

The GIS algorithm first initializes a parameter µjto ‘1’ for

each constraint and executes the following procedure until

Page 3

convergence:

µ(n+1)

j

= µ(n)

j

?

P(fj)

P(n)(fj)

? 1

C

where

P(n)(fj) =

?

l?

x∈X

p(n)(x)fj(x)

p(n)(x) = π

j=1

(µ(n)

j)fj(x)

C = some constant

fj(x) = 1 if fj⊆ x

= 0 otherwise

The variable n in the above system of equations denotes

the iteration number. P(n)(fj) is the expected support of

fj in the nth iteration while P(fj) is the actual support

of fjcalculated from the training dataset. Convergence is

achieved when the expected and actual supports of every

fjare nearly equal. π is a normalization constant which

ensures that?

4Estimating Feature Probabilities

In this section, we describe how feature probabilities may

be estimated from the training data. We begin with a few

definitions:

Definition 1. The Sequence count of a feature Xjin family

Fiis the number of sequences of family Fiin which feature

Xjis present at least once.

Definition 2. The Repeat count of a feature Xjin family

Fiis the sum of the number of occurrences of that feature

in each sequence of the family.

Definition 3. A feature Xjis frequent in family Fiiff

x∈Xp(x) = 1.

Sequence count of Xjin Fi≥ σ

where σ is the Minsup count for family Fi, and is calcu-

lated using the user given support threshold minsup and

Ni(total number of sequences in family Fi) as:

σ = Ni× minsup

Definition4. Let? = {X1,X2,...,X|?|}bethesetofall

frequent features extracted from family Fi. Feature Xj ∈

? is maximal frequent in family Fiiff

?Xk∈ ? such that Xk⊃ Xj

Either Sequence or Repeat count may be used to esti-

mate the probability P(Xj|Fi) of a feature Xjin a family

Fi.

Using Sequence count is simple:

a good estimate of P(Xj|Fi), where Niis the number of

sequences in Fi. Though this is simple and efficient, it does

not account for multiple occurrences of Xjin a sequence.

The alternative is to use Repeat count that uses all the

occurrences of a feature. We use following method pro-

posed in [31] to find P(Xj|Fi) using Repeat count:

Sequence count of Xj

Ni

is

1. Find the number of slots available for Xjin family Fi

(containing Nisequences).

• If we consider that the features may overlap:

Ni

?

slotsij=

k=1

?length of Sk−length of Xj+1?

• If we consider non overlapping features:

Ni

?

2. Find the probability of feature Xjin family Fias:

(4)

slotsij=

k=1

?

length of Sk

length of Xj

?

(5)

P(Xj|Fi) =Repeat count of Xjin Fi

slotsij

(6)

Equations 4 and 5 find the total slots for feature Xjin fam-

ily Fiby summing the available slots in each sequence Sk

of Fi. Next, the feature probability is estimated as the frac-

tion of times Xjactually occurs over the slots.

5

Bayesian classifiers like Naive Bayes (NB) represent the

query sequence S as a feature vector− →

...,Xm} and use the feature probabilities P(Xj|Fi)s to

estimate the class-conditional probability P(S|Fi). The

class-conditional probability is used to compute the pos-

terior probability of each family as:

Problems with Bayesian Classifier

X = {X1,X2,

P(Fi|S) ∝ P(Fi)P(S|Fi)

Problem 1:Features Not Represented in the

Training Data

Since calculation of P(Xj|Fi) is based on the presence of

Xj in the training data of class Fi, a problem can arise

if Xjis completely absent in the training data of class Fi.

The absence of Xjis quite common because training data

is typically too small to be comprehensive, and not because

P(Xj|Fi) is really zero. This problem is compounded by

the resulting zero probability for any sequence S that con-

tains Xj. Evidence based on other subsequences of S may

point to a significant presence of S in Fi. Due to this prob-

lem, the existing Bayesian formulation described in Equa-

tion 7 cannot be applied directly on biological sequences

when frequent subsequences are used as features. Known

solutions to this problem are:

(7)

5.1

1. Useanonuniformfeaturevector, i.e., usedifferentfea-

ture vectors of query sequence S for each class which

include only those features of S which are present in

that class. Then set P(S|Fi) = 0 only if none of the

features of S is present in class Fi.

This solution has a drawback: classes with more

matching features of S could be computed as hav-

ing less posterior probability due to the multiplication

Page 4

of more feature probabilities whose values are always

less than one. This results in wrong classification and

is illustrated in Example 1 shown in Figure 1.

Example 1.

samples each, so that the prior probabilities of the classes are

P(C1) = P(C2) =

{X1,X2,X3,X4} has two matching features in class C1 with

probabilities

Suppose C1 and C2 are two classes with 10

1

2. A query sample S with feature vector

P(X1|C1) =

1

10and P(X3|C1) =

3

10

and four matching features in class C2with probabilities

P(X1|C2) =

1

10, P(X2|C2) =

2

10,

P(X3|C2) =

3

10and P(X4|C2) =

2

10

Using Equation 7, the posterior probabilities of the classes are

obtained as

P(C1|S) =

3

200and P(C2|S) =

6

10000

Since P(C1|S) > P(C2|S), the query sample gets classified into

class C1, although intuitively we know that class C2is more suit-

able because it contains more matching features than class C1.

Figure 1: An example showing the drawback of using a

non-uniform feature vector.

2. Incorporate a small sample-correction into all proba-

bilities, such as Laplace correction [18]. The Laplace

correction factor requires changing all the probability

values, so it is not feasible for datasets with a large

feature set like biological sequence datasets.

3. If a feature value does not occur in a given class, then

set its probability to

examples in the training set [18].

1

N, where N is the number of

The Simple Naive Bayes (Simple NB) classifier uses Se-

quence count with solution (1) to obtain the model of the

dataset. In REBMEC we use a different solution proposed

in [31] and described in Section 6.3, which outperformed

the Simple NB classifier.

5.2Problem 2: Out of Range Probability Values

Probability values obtained using equations such as Equa-

tions 6 and 8 are very small. When these very small val-

ues are multiplied to obtain the class-conditional probabil-

ity to be used in Equation 7, the product can go below the

available minimum number range of the processor. This

is a problem with all Bayesian classifiers which work with

large feature sets and assume independence among the fea-

tures and hence directly multiply the feature probabilities

to get the class-conditional probabilities. An appropriate

scaling factor or log scaled formulation is used to avoid

this problem.

Weusedlogscaledformulationtoavoidthisproblemfor

the Simple NB classifier. The proposed classification algo-

rithm of REBMEC implicitly uses a log scaled approach

for finding class-conditional probabilities that can handle

small probability values and hence easily deals with this

problem.

6

Like all Bayesian classifiers, REBMEC also works in two

phases: training phase and classification phase. It also uses

a feature selection phase as part of the training phase to

select important features and prune irrelevant features. So,

the REBMEC classifier runs in three phases:

1. Feature Extraction: This is the training phase in

which first maximal frequent subsequences are ex-

tracted as features from each family and stored with

their Repeat and Sequence counts. Then for each fam-

ily, the Repeat and Sequence counts for maximal fea-

tures from other families, which are not maximal in

this family, are also stored. This is to ensure that all

families share the same feature set.

The REBMEC Classifier

2. Feature Selection: The extracted feature set is pruned

in this phase using an entropy based selection crite-

rion. The result is a smaller set of features and their

Repeat and Sequence counts within each family. The

feature extraction and selection phases are executed

only once to train the classifier. After this the original

dataset is no longer required and the classifier works

with the reduced feature set left after pruning.

3. Classification: This phase is executed for labeling

a query sequence with the family having the highest

posterior probability. The classifier first separates all

the features belonging to the query sequence from the

available feature set from the second phase, to make

a query-based uniform feature set. It then uses this

uniform feature set to find the posterior probability of

each family and outputs the one with the highest pos-

terior probability.

6.1

Any kind of features can be supported in the classification

algorithm of REBMEC, but we have used maximal fre-

quent subsequences as features. Use of these simple fea-

tures avoids the need for complex data transformations and

domain knowledge which is required by the existing so-

phisticated feature mining algorithms [14, 20] for biolog-

ical sequences. It was observed in [31] that frequent sub-

sequences capture everything that is significant in a collec-

tion of sequences. But to keep the feature set small and

to satisfy the following necessary criteria for features of

any classifier [20], using maximal frequent subsequences

as features is a better option [31]:

1. Significant features:We ensure this by consider-

ing only frequent features (i.e., Sequence count ≥

Minsup count).

Feature Extraction

Page 5

2. Non-redundant Features: We ensure this by using

maximal frequent subsequences.

3. Discriminative Features: For ensuring this, we use

the entropy based selection criteria described in Sec-

tion 6.2 after extraction of features.

We extracted maximal frequent subsequences as features

using an Apriori-like method which uses the same user

given minimum support threshold minsup for all families.

We extracted all possible features from a family by setting

the maximum length of the feature to be extracted as the

length of the largest sequence of the training dataset.

Frequent subsequence extraction from a biological se-

quence dataset is a time consuming and memory intensive

process. We optimized this process by avoiding extraction

of infrequent subsequences by storing information of their

location in a bit-vector. This bit-vector based optimiza-

tion of frequent subsequence extraction method, proposed

in [31], is explained below.

This method is based on the idea that the presence of a

‘1’ in a bit-vector indicates that a frequent subsequence of

length l can be extracted from the corresponding position

in the sequence. The presence of a ‘0’ indicates that the

subsequence of length l at the corresponding position in the

sequence is infrequent. It follows that subsequences longer

than l from this position will also be infrequent. Hence the

bit will remain ‘0’.

So, after initializing a bit-vector of ‘1’s for each se-

quence in a family, which is of the same length as the

sequence, the procedure starts extracting frequent subse-

quences of length one and iteratively proceeds to longer

subsequences. In the first phase of each iteration, candidate

subsequences of length l are counted. In the second phase,

the bit positions corresponding to frequent subsequences of

length l are set to ‘1’, to be considered in the next iteration.

6.2 Entropy Based Selection of Discriminating Fea-

tures

As is typical of frequent pattern mining, the feature extrac-

tionphaseproducestoomanyfeaturesandcreatestheprob-

lem of curse of dimensionality. This problem increases as

the minsup decreases, since the number of extracted fea-

tures increases exponentially as minsup decreases. We al-

leviate this problem by applying a feature selection phase

that selects only discriminating features [20] for each class.

Our feature selection criterion is based on entropy. En-

tropy based criteria like information gain and gain ratio

have been widely used to select features for classifiers [18].

Since our aim is to find discriminating features for each

family, we use low values of H(D|Xj = present), i.e.,

entropy of the dataset in the presence of a feature as the

selection criterion:

H(D|Xj= present) = −

N

?

i=1

?P(Fi|Xj= present)×

log[P(Fi|Xj= present)]?

where

P(Fi|Xj= present) =

Sequence count of Xjin Fi

?

k

Sequence count of Xjin Fk

Analysis of this criterion gives us the following observa-

tions:

1. H(D|Xj = present) = 0 when a feature Xj is

present in one and only one family.

2. H(D|Xj= present) is higher when a feature Xjis

present in all families.

This criteria is opposite of the information gain criteria as

it selects features with low entropy values thereby select-

ing discriminating features. For selecting features we use

a user-given threshold Hthto compare with the calculated

value of H(D|Xj = present), and select all the features

satisfying the criteria H(D|Xj = present) ≤ Hthwhile

pruning the others.

Experimentally we found that for very low minsup val-

ues, using threshold Hth= 0 gives good results in the clas-

sification phase. But for other minsup values good results

are obtained by setting Hthas1

H(D) is the total entropy of the dataset which is defined

as:

H(D) = −

i

This happens because with Hth= 0, many important fea-

tures get pruned. In our experiments, the above entropy

based selection not only found discriminating features for

all families, but also reduced the number of features by

36% for low minsup values, as shown in Table 1.

2H(D) or1

3H(D), where

?

P(Fi)log[P(Fi)]

6.3

REBMEC uses a very simple assumption to handle the

problem of zero probabilities and the problem arising from

the use of a nonuniform feature set (discussed in Sec-

tion 5.1). It assumes that the probability of any feature to

be present in any family is never zero. So for the features

of other families which are not present in a given family, it

usesacorrectionprobability?, whichistheminimumpossi-

ble probability computed using Repeat counts for a feature.

It is obtained as:

Classification Phase

? =

1

Sum of the lengths of sequences of the largest family

(8)

6.3.1

For classifying a query sequence S, REBMEC finds the

uniform feature set ?, which is the set of features present

in S, collected from all families. It uses Equation 6 for

findingprobabilitiesoffeaturespresentinafamilyanduses

correction probability ? as the probability of features not

present in that family.

Classification Algorithm of REBMEC

Page 6

It then uses the features in set ? to make the constraint

set for each family. Each feature Xj with its probabil-

ity P(Xj|Fi) in family Fiforms a constraint that needs

to be satisfied by the statistical model for that particular

family. Also, the “correction” feature flwith expectation

value E(fl) obtained using Equation 3 is added to this set.

Note that unlike other feature probability values, this value

rangesfrom0toC. Thus, foreachfamilyFi, thereisacon-

straintsetCSi= {(Xj,P(Xj|Fi)|Xj∈ ?)∪(fl,E(fl))}.

Since there could be multiple models satisfying these con-

straints, the proposed algorithm ComputeProb, like GIS,

selects the one with maximum entropy and finds the pa-

rameters of that model. In doing so, it finds the class-

conditional probability P(S|Fi) of that family.

REBMEC then finds the posterior probability of all fam-

ilies using the log scale version of Equation 7 as follows:

LP(Fi|S) = LP(S|Fi) + LP(Fi)

where LP(Fi|S) = log[P(Fi|S)]

LP(S|Fi) = log[P(S|Fi)]

LP(Fi) = log[P(Fi)]

Finally, it classifies the query sequence into the family with

the largest posterior probability. The pseudo-code for the

method discussed above is shown in Figure 2.

(9)

1. Find features from all families which are present in the

query sequence S and make uniform feature set ? =

{X1,X2,...,Xm}.

2. for each family Fido

(a) for each feature Xj ∈ ? do

if Xj ∈ Fi: compute P(Xj|Fi) using Equation 6

else: set P(Xj|Fi) = ?

3. Run ComputeProb to obtain all LP(S|Fi)s.

4. Compute all LP(Fi|S)s using Equation 9.

5. Find family Fkhaving the largest value of LP(Fi|S).

6. Classify S into Fk.

Figure 2: Classification algorithm of REBMEC.

6.3.2

ComputeProb is a GIS based method which, unlike GIS,

computestheclass-conditionalprobabilitiesinsteadofstor-

ing the parameter values of each constraint.

shows pseudo-code of the ComputeProb algorithm and is

described below.

To make the large feature set manageable, it divides the

feature set ? into small sets using Hamdis as the similarity

measure to find the set of similar features. This similarity

measure is a simple modification of the Hamming Distance

to take into account features of different lengths and is de-

fined as:

The ComputeProb Algorithm

Figure 3

Hamdis(X1,X2) = No. of positions differing in symbols

+ |length(X1) − length(X2)|

The algorithm selects one feature from ? and calculates

Hamdis for all other features with respect to the selected

feature. Then it groups k features with least Hamdis, to-

gether with the selected feature to make the small feature

set ?f. This process is repeated till there are less than k

features left in ?, which are grouped together along with

the correction feature.

1. Using Hamdis divide the feature set ? into small sets as

??= {?1,?2,...,?M}, such thatS

2. for each family Fi:

3.Initialize LP(S|Fi) = 0

4.

for each small feature set ?f ∈ ??:

5.

6.

for k = 0 to last :

7.

8.

for each feature Xj ∈ ?f :

9.Initialize µj = 0

10.

while all constraints are not satisfied

11.

{

12.

for each feature Xj ∈ ?f :

13.Initialize Sumj = 0

14.

for k = 0 to last :

15.

if Tkcontains Xj :

16.Update Sumj = Sumj+ exp(LP(Tk))

17.Set LP(Xj|Fi) = log(P(Xj|Fi))

18.Update µj = µj+ LP(Xj|Fi) − log(Sumj)

19.

for k = 0 to last :

20.

for each feature Xj ∈ ?f :

21.

if Tkcontains Xj :

22.Update LP(Tk) = LP(Tk) + µj

23.Initialize normSum = 0

24.

for k = 0 to last :

25.Update normSum = normSum+exp(LP(Tk))

26.Set µ0 =

normSum

27.

for k = 0 to last :

28.Update LP(Tk) = LP(Tk) + log(µ0)

29.

}

30.Update LP(S|Fi) = LP(S|Fi) + LP(Tlast)

31.Return LP(S|Fi)

Figure 3: The ComputeProb algorithm.

i?i = ?

and ?i∩ ?j = φ

#Class conditional Probability

Set last =`2|?f|´− 1

Initialize LP(Tk) = log`

1

last+1

´

1

#Normalization factor

For computing the class-conditional probability of a

family, ComputeProb finds LP(S|Fi) for each small set

of features and later combines them by assuming indepen-

dence among the sets. It uses bit-vectors Tkto represent the

presence/absence of |?f| features of the set ?f. Specif-

ically, in steps 6 to 9, it initializes the parameters µjand

probabilities LP(Tk). In steps 12 to 22, it updates the µj

and LP(Tk) values using the probabilities of features ob-

tained from the training data. In steps 23 to 28, it finds

the normalization constant µ0and applies it to the LP(Tk)

values. Finally, in step 30, it updates the LP(S|Fi) value

using the obtained value of LP(Tlast) for that feature set,

where the bit-vector Tlast represents that all features of

the set ?fare present in the query sequence S. Note that

the LP(S|Fi) values returned by this algorithm are in log

scale. Also note that during implementation Tks need not

be stored but can be computed on the fly.

Page 7

6.3.3Additional Issues

Calculation of the expectation E(Fl) of the correction fea-

ture flin a family Fi, using Equation 3, requires scanning

all the sequences of that family. We observe that, using

Equation 3, the minimum calculated value of E(fl) in a

family will be (C−|?|) when all the features of the feature

set ? are present in all the sequences of that family. And

the maximum calculated value of E(fl) in a family will be

C when none of the features of the feature set ? are present

in any sequence of that family. Based on these observations

we used the minimum expectation value (C − |?|) as the

approximate expectation value of the correction feature fl

in each family. This removed the need for scanning all the

sequences of a family in the classification phase for cal-

culating the expectation of fl. In practice we found this

approximation to be good.

In our experiments we observed that if the correction

feature flis not added to the constraint set with proper ex-

pectation value, then the algorithm is not able to compute

correct class conditional probabilities; so using the correc-

tion feature properly is a very important part of the algo-

rithm. We also observed that the correction feature can be

added either to each small group of features ?f with ap-

proximate expectation value (|?| − |?f|) or only to one

group with value (C − |?|). Both methods give exactly

the same results which means that both methods produce

the same effect on the parameter values. Since adding the

correction feature to each group of features increases the

overall running time, it is better to add it to only one group

with appropriate expectation value.

Like other Bayesian methods, GIS based methods also

deal with very small parameter and probability values, so

they also need to tackle “out of range” parameter values

discussed in section 5.2. In case of GIS based methods,

thisproblem becomesevenmore seriousdueto theiterative

nature of these methods. To deal with it, we have designed

ComputeProb using log scale. In our experiments we ob-

served that due to the large value of the constant C, the iter-

ation process overshoots the convergence point; so to make

the increment value smaller for each iteration, we have not

used constant C in the calculation of increment value (in

step 18). As discussed in [33] and observed in our exper-

iments the number of iterations required for the model to

converge can be hard-coded and the algorithm can be made

to stop once it reaches those many iterations, so the steps

10 to 29 are iterated for a fixed number of times.

7Performance Model

In this section, we describe the datasets and the necessary

implementation details. We have used two collections of

protein families to evaluate the performance of REBMEC.

The first collection is a large collection of 10922 G protein-

coupled receptors (GPCRs) taken from the March-2005

Release 9.0 of GPCRDB [15] (http://www.gpcr.org/7tm).

Since the GPCR families are divided in hierarchies of sub-

families, we have used the sequences of the highest level

of families called the super families. After the removal of

unclassified and redundant sequences, the dataset consists

of 8435 sequences arranged in 13 families. This collection

is a skewed dataset which has the peculiar property that

the largest family called class A contains 6031 sequences

which is 71% of the total sequences of the dataset.

The second collection of 3146 proteins arranged in 26

families was obtained from the Feb-2008 Release 55.0 of

SWISS-PROT [8] using the list of SWISS-PROT protein

IDs obtained from Pfam [4] version 22.0. After the redun-

dant sequences were removed from each family, the num-

ber of sequences were down to 2097. This set of families

has already been used in [5, 12, 13], so using it allows a

direct comparison with their methods. We should note that,

due to the constant refinement in the topology of the Pfam

database, there are significant differences in the families

common to the two collections.

All the classifiers were assessed based on the standard

Accuracy measure, which gives the percentage of correctly

classified test sequences:

Accuracy =NumCorrect

NumTested

× 100%

All the experiments were performed on a 1.9 GHz AMD

Athlon 64 processor machine with 385 MB of main mem-

ory, running Linux Fedora Core 4. All the programs were

written in the Perl language.

For comparison with the Naive Bayes classifier, we

implemented the Simple NB classifier discussed in Sec-

tion 5. For comparison with the maximum entropy clas-

sifier, which uses Sequence count, we implemented the

Maxent-A model which exactly follows all the steps used

in REBMEC classifier and uses the same classification

algorithm but uses Sequence count to find the probabil-

ity of a feature. It uses the correction probability ? =

1

No. of sequences of the largest familyfor the features not present in

a family. In all experiments we used the same set of fea-

tures for these classifiers that we used for the REBMEC

classifier.

Based on our observation of convergence of the µ val-

ues, we fixed the number of iterations as 50 and the number

of features in each group as 10 for implementing the Com-

puteProb algorithm.

For finding the accuracy of the classifiers, we divided

the dataset into training and test sets in the ratio 9:1 with

stratification (to include representatives from all families in

the test set in their proper proportion), but to keep the test

set small we included at most 50 sequences from each fam-

ily. We used the same minsup for all families of a dataset to

extract the maximal frequent subsequences from the train-

ing set. We experimented with different minsup values and

found the best support value for each classifier according

to the accuracy achieved with features obtained from that

support value.

We did not put any restriction on the maximum length

of maximal frequent subsequence to be extracted. We set

it as the length of the largest sequence of the dataset. This

Page 8

enabled us to extract all the possible maximal frequent sub-

sequences from the dataset. Our experiments showed that

the extraction process terminates long before reaching the

maximum length, because as the subsequence length in-

creases, its possibility of being frequent decreases. We

used maximal subsequences of length greater than two as

features, to avoid trivial frequent subsequences of lengths

one and two.

8

In this section we present the results. Table 1 gives the

number of features used in the experiments for different

minsup values. The Without Pruning column gives the

number of features when the feature set is not pruned and

the With Pruning column gives the number of features re-

maining in the feature set after pruning them on the basis of

entropy. For 5% minsup we used threshold Hth= 0.0 and

for other minsup values we used1

as threshold to prune the features. In all the experiments we

used the features remaining in the feature set after pruning.

Experimental Results

2(entropy of the dataset)

Table 1: Number of features used in experiments with the

Pfam dataset for different minsup values

minsup (%)

5

10

30

50

Without Pruning

24005

10743

2627

1131

With Pruning

15381

6740

2285

1100

Table2: ClassificationresultsofSimpleNB,Maxent-Aand

REBMEC for the collection of 211 proteins of 26 families

from Pfam 22.0 for different minsup values

Classifier

minsup (%)

5

10

30

50

5

10

30

50

5

10

30

50

Accuracy (%)

0.47

0.0

2.84

10.43

66.82

48.34

23.22

16.11

90.52

84.36

77.73

69.19

Simple NB

Maxent-A

REBMEC

Table 2 summarizes the results of experiments done on

the classifiers with different minsup values.

Tables3and4showthefamily-wiseandaverageaccura-

cies of the classifiers with corresponding minimum support

values. The Size column of Table 3 and the Non Redundant

Size column of Table 4 give the number of sequences of

each family after removing duplicated sequences from the

family (before dividing it into training and test sets). In Ta-

ble 4 we have included the published results of the C clas-

sifier of [13] and the results of the matching families of the

PST classifier published in [5]. Please note that the test sets

and evaluation methods used by them are different from

ours and so the results should not be compared strictly. Due

to the changes in the naming convention of Pfam families

we were not able to find results for some families from the

published results of PST classifier and hence we have not

reported the average accuracy of the PST classifier.

9

In this section we discuss the merits of REBMEC as seen

in the experimental results.

From the results of Table 3, it is evident that the Sim-

ple NB classifier is biased towards the largest family and in

its presence it is not able to recognize sequences of other

families. But REBMEC is able to break the biasing effect

of large classes and so performs very well on the skewed

dataset also. The results indicate that there are some fam-

ilies like Class E and Ocular albinism proteins for which

all the classifiers perform the same. Observation of the fea-

ture set shows that these families have many discriminative

features. And there are some families like Class A and In-

sect odorant receptors which have very few discriminative

features. Only REBMEC gives acceptable performance for

these families.

The results of Table 2 clearly show that REBMEC

performs much better than the simple NB classifier and

Maxent-A model for all minsup values. This indicates that

REBMEC can perform well on datasets with large feature

sets, whereas the other Bayesian classifiers can not.

As the minsup value decreases, the performance of

REBMEC improves and is highest at 05% minsup. In

case of the Simple NB classifier, performance improves

as minsup increases because the number of features de-

creases.The Maxent-A classifier performs better than

the Simple NB classifier and its performance increases as

minsup decreases, but it never gives accuracy comparable

to the accuracy of REBMEC.

Since the Maxent-A model also uses the same frame-

work used by REBMEC but with Sequence count, the per-

formance of REBMEC indicates that use of Repeat count

is able to model the families in a better way than Sequence

count. This confirms the observation made in [31]. Hence

use of Repeat count is recommended over Sequence count

to obtain a good classification model for sequence data like

biological sequences.

Results presented in Table 4 on the Pfam dataset indi-

cate that performance of REBMEC is comparable to the

performances of C classifier of [13] and PST [5]. It per-

forms better than the C classifier for large families with

more number of sequences. Clearly the PST classifier is

the consistent performer among the three classifiers, whose

performance is almost the same for all families. The fea-

Discussion

Page 9

Table 3: Family-wise and Average Classification results of Simple NB, Maxent-A and REBMEC for the collection of 265

GPCRs of 13 families from GPCRDB

Family Name

Class A

Class B

Class C

Class D

Class E

Ocular albinism proteins

Frizzled family

Insect odorant receptors

Plant Mlo receptors

Nematode chemoreceptors

Vomeronasal receptors

Taste receptors T2R

Class Z Archaeal opsins

Average Accuracy (%)

minsup (%)

Size

6031

309

206

65

10

8

130

236

52

755

286

237

110

Simple NB (%)

74.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

10.3

0.0

9.0

15.47

50

Maxent-A (%)

52.0

96.7

100.0

66.7

100.0

100.0

92.3

4.2

100.0

44.0

82.7

87.5

100.0

67.17

5

REBMEC (%)

90.0

96.7

100.0

50.0

100.0

100.0

92.31

62.5

100.0

100.0

96.5

91.7

100.0

91.7

10

Table 4: Family-wise and Average Classification results of Simple NB, Maxent-A, REBMEC, Ferreira et al.’s C model and

PST for the collection of 211 proteins of 26 families from Pfam 22.0

Family

Name

7tm-1

7tm-2

7tm-3

AAA

ABC-tran

ATP-synt-ab

ATP-synt-A

ATP-synt-C

c2

CLP-protease

COesterase

cox1

cox2

Cys-knot

Cytochrom-B-C

Cytochrom-B-N

HCV-NS1

Oxidored-q1

PKinase

PPR

RuBisCO-large

rvt-1

RVT-thumb

TPR-1

zf-C2H2

zf-CCHC

Accuracy (%)

minsup (%)

Actual

Size

64

32

29

229

63

151

30

35

394

88

126

23

32

24

9

8

10

33

54

558

16

156

41

562

196

183

Non Redundant

Size

64

32

29

210

57

151

30

33

286

88

126

23

32

24

9

8

10

33

54

100

16

156

41

274

101

110

Simple NB (%) Maxent-A (%) REBMEC (%)C (%) PST (%)

0.0

0.0

0.0

28.57

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

55.17

0.0

0.0

10.43

50

33.3

66.7

66.7

76.2

33.3

86.7

33.3

66.7

55.2

77.8

100.0

100.0

100.0

100.0

0.0

100.0

100.0

100.0

40.0

90.0

100.0

100.0

100.0

24.1

60.0

63.6

66.82

5

100.0

66.7

66.7

100.0

83.3

100.0

66.7

66.7

82.76

88.9

100.0

100.0

100.0

100.0

100.0

100.0

100.0

100.0

60.0

100.0

100.0

100.0

100.0

93.1

70.0

81.82

90.52

5

100.0

100.0

100.0

97.7

100.0

98.9

89.3

97.0

91.3

94.1

87.3

90.9

100.0

100.0

100.0

85.7

100.0

93.5

96.2

22.6

81.3

79.6

91.2

54.9

88.4

100.0

90.0

–

93.0

94.4

83.3

87.9

83.6

96.7

92.4

96.7

92.3

87.9

91.7

83.8

98.2

93.4

79.2

98.2

–

97.1

85.1

–

98.7

88.4

–

–

92.3

88.6

–

–

Page 10

ture extraction process of each family is fine tuned in [13]

by using differentminsup values for different families and a

sophisticated feature extraction algorithm [14] is used. We

note that REBMEC attains comparable performance with-

out using domain specific ideas and incorporation of such

ideas could further increase the performance of REBMEC.

The families of the Pfam dataset were constructed on

the basis of sequence similarity using HMM [4], while the

families of GPCRDB were constructed manually on the ba-

sis of the function of proteins [15]; so similarity among se-

quences of a family of the Pfam dataset is higher than the

GPCRDB dataset. Due to this, classifiers like PSTs which

use Variable Length Markov Model perform better on the

Pfam dataset. Comparing results of Tables 3 and 4 shows

that although the GPCRDB dataset is larger in size and

skewed than the Pfam dataset, REBMEC performs better

on the GPCRDB dataset than the Pfam dataset. So REB-

MEC can be better than other sequence classifiers for large

and skewed datasets with less similarity among sequences.

Note that we have used a very naive approach to find

similarity between features because we wanted to keep the

algorithm simple. The performance of REBMEC will in-

crease if more sophisticated similarity methods based on

domain knowledge are used to choose features of a group.

We also experimented with grouping features randomly

and found that results are the same. This indicates that the

features are almost independent of each other.

The Bayesian classifier of [3] requires the feature length

to be supplied by the user. Other classifiers like PSTs,

SMTs and the C classifier also require parameters other

than minsup to be supplied by the user for the feature ex-

tractionprocess. Itisobservedthatperformanceoftheclas-

sifiers are very sensitive towards these parameters. REB-

MECdoesnotrequirefeaturelengthoranyparameterother

than minsup for the feature extraction process and the pa-

rameter required for pruning can be set easily according to

the entropy of the dataset as described in Section 6.2.

Scalability and Computational Requirements:

classification time of REBMEC and Maxent-A is much

larger than the other classifiers. For example, REBMEC

takes more than an hour to classify each test sequence

whereas the other classifiers take only a fraction of sec-

ond or at most a few minutes. The reason for this is the

iterative nature of the GIS-based procedure. However, the

advantage of REBMEC is that it is a theoretically robust

method that makes no assumptions about the underlying

data distribution. Hence REBMEC can be used as a good

benchmark to compare other classifiers. The memory con-

sumption of REBMEC in the classification phase is also

quite low because it does not store bit-vectors Tk, and in-

stead computes them on the fly (see Section 6.3.2) This is

unlike the HMMs, PSTs and SVM based classifiers which

require large memory.

REBMEC has the advantage of very low computational

requirements for the training phase as this phase involves

only finding counts of features in each family. HMMs,

PSTs, SMTs and SVM based classifiers have high mem-

The

ory requirements for the training phase and become com-

putationally very expensive when the number of classes in-

creases [5, 12]. Another point worth mentioning is that

SVM based classifiers require carefully selected positive

and negative samples for each class in the training phase

while REBMEC uses only positive samples.

10

The principle of maximum entropy has been widely used

for a variety of natural language processing tasks like

translation [7], document classification [27], identifying

sentence boundaries [33], prepositional phrase attach-

ment[33], part-of-speechtagging[33], andambiguityreso-

lution[33]. Theauthorsof[25]studythemultinomialmod-

els for document classification, which capture the word fre-

quency information in each document, and show that these

models perform better than the multi-variate Bernoulli

model. Use of Repeat count in REBMEC is similar to the

multinomial models but it follows a very different approach

for finding the feature probabilities which are used to build

the model. Basically, the multinomial models represent

training samples as bag of words whereas in REBMEC we

model them as sequences.

The authors of [35] present the ACME classifier, which

uses maximum entropy modeling for categorical data clas-

sification with mined frequent itemsets as constraints. The

authors also propose the approach of dividing the feature

set into independent clusters of features for reducing the

computational cost.

A sequence modeling method using mixtures of condi-

tional maximum entropy distributions is presented in [28].

This method generalizes the mixture of first order Markov

models by including the long term dependencies, known as

triggers, in the model. [9] uses maximum entropy for mod-

eling protein sequence, using unigram, bigram, unigram

cache and class based self triggers as features. Both [28, 9]

use the history of symbols to predict the next symbol of a

sequence.

Examples of Bayesian Sequence Classifiers can be

found in [3, 13, 16, 17]. The authors of [3] propose that the

NB classifier can be used for protein classification by rep-

resenting protein sequences as class conditional probability

distribution of k-grams (short subsequences of amino acids

of length k). The length of k-grams used in this method is

a user supplied parameter and the performance of the clas-

sifier is very sensitive towards this parameter.

The approach used in [13] is to use the unlabeled se-

quence to find rigid gap frequent subsequences of a cer-

tain minimum length and use these to obtain two features

which are combined in the NB classifier. This approach

uses a computationally expensive subsequence extraction

method [14] which is guided by many user supplied pa-

rameters. This method defers the feature extraction phase

till classification time which makes it computationally ex-

pensive for large datasets.

The authors of [16] introduce a novel word taxonomy

based NB learner (WTNBL-ML) for text and sequences.

Related Work

Page 11

This classifier requires a similarity measure for words to

build the word taxonomy. The authors of [17] present a

promising recursive NB classifier RNBL–MN, which con-

structs a tree of Naive Bayes classifiers for sequence clas-

sification, where each individual NB classifier in the tree is

based on a multinomial event model (one for each class at

each node in the tree).

Probabilistic Suffix Tree based Sequence Classi-

fiers [5, 12] predict the next symbol in a sequence based

on the previous symbols. Basically a PST [5] is a variable

length Markov Model, where the probability of a symbol

in a sequence depends on the previous symbols. The con-

ditional probabilities of the symbols used in PSTs rely on

exact subsequence matches, which becomes a limitation,

since substitutions of symbols by equivalent ones is often

very frequent in proteins. The proposed classifier of [12]

tries to overcome this limitation by generalizing PSTs to

SMTs with wild-card support, which is a symbol that de-

notes a gap of size one and matches any symbol on the al-

phabet. An experimental evaluation in [5] shows that PSTs

perform much better than a typical PSI-BLAST search and

as well as HMM. As bio-sequence databases are becoming

larger and larger, data driven learning algorithms for PSTs

or SMTs will require vast amounts of memory.

HMM based Sequence Classifiers [11, 19] use HMM

to build a model for each protein family based on multiple

alignment of sequences. HMM models suffer from known

learnability hardness results [1], exponential growth in

number of states and in practice, require a high quality mul-

tiple alignment of the input sequences to obtain a reliable

model. The HMM based classifiers use algorithms which

are very complex to implement and the models they gener-

ate tend to be space inefficient and require large memory.

Similarity based Sequence Classifiers [2, 22, 29] com-

pare an unlabeled sequence with all the sequences of the

database and assess sequence similarity using sequence

alignment methods like FASTA [29], BLAST [22] or PSI-

BLAST [2] and use K nearest neighbor approach to clas-

sify the sequence. But as the number of sequences in bio-

databases is increasing exponentially, this method is infea-

sible due to the increased time required to align the new

sequence with the whole database.

SVM based Sequence Classifiers [6, 21, 24, 26] either

use a set of features of protein families to train SVM or

use string kernel based SVMs, alone or with some stan-

dard similarity measure like BLAST or PSI-BLAST or

with some structural information. These classifiers require

a lot of data transformation but report the best accura-

cies. Since SVM is basically a binary classifier, to handle

a large number of classes, it uses the one against the rest

method, which becomes computationally very expensive as

the number of classes increases.

11

In this paper, we proposed a new maximum entropy based

classifier, REBMEC, which uses a GIS based novel frame-

work to build the classification model and incorporates Re-

Conclusions and Future Work

peat count of features present in the query sequence. It

adapts GIS to deal with a large feature set. It uses the sim-

plest possible features of sequence families in the form of

maximal frequent subsequences and handles the problem

associated with Bayesian classifiers. We also discussed the

issues of inclusion of the correction feature and calculation

of increment values, which need to be handled properly in

GIS-based methods.

By experiments on two datasets, we demonstrated that

REBMEC’s performance is comparable to the PST and the

C classifier, and is superior to the other Bayesian classifiers

on large and skewed datasets. REBMEC has the advantage

of very low computational requirements as it does not use

any complex feature extraction method, domain knowledge

ordatatransformation. Itdoesnotrequireanyusersupplied

parameters which can affect the performance other than

minsup and the pruning threshold. We demonstrated that

the use of Repeat count gives a better model of sequence

families than Sequence count.

Future work includes (1) finding good similarity mea-

sures and methods to divide features in small sets with

minimum or no dependence among the features of differ-

ent sets, (2) evaluating REBMEC’s performance with so-

phisticated feature extraction methods which can extract

gapped subsequences supporting wild-cards, and (3) eval-

uating REBMEC’s performance on other domains which

have data in sequential form.

References

[1] N. Abe and M. K. Warmuth. On the computational

complexity of approximating distributions by prob-

abilistic automata.Machine Learning, 9:205–260,

1992.

[2] S. F. Altschul et al. Gapped BLAST and PSI-BLAST:

a new generation of protein database search pro-

grams. Nucleic Acids Research, 25:3389–3402, 1997.

[3] C. Andorf, A. Silvescu, D. Dobbs, and V. Honavar.

Learning classifiers for assigning protein sequences

to gene ontology functional families. In Proc. of the

Fifth Int. Conf. on KBCS, pages 256–265, 2004.

[4] A. Bateman et al.

database.

Issue):281–288, 2008.

The Pfam protein families

Nucleic Acids Research, 36(Database-

[5] G. Bejerano and G. Yona. Modeling protein families

using probabilistic suffix trees. In Proc. of RECOMB,

pages 15–24, 1999.

[6] A. Ben-Hur and D. Brutlag. Remote homology de-

tection: a motif based approach. Bioinformatics, 19

Suppl 1:26–33, 2003.

[7] A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A max-

imum entropy approach to natural language process-

ing. Computational Linguistics, 22(1):39–71, 1996.

Page 12

[8] B. Boeckmann et al.

knowledgebase and its supplement TrEMBL in 2003.

Nucleic Acids Research, 31(1):365–370, 2003.

The SWISS-PROT protein

[9] E. C. Buehler and L. H. Ungar. Maximum entropy

methods for biological sequence modeling. In Proc.

of BIOKDD, pages 60–64, 2001.

[10] J. N. Darroch and D. Ratcliff. Generalized iterative

scalingforlog-linearmodels. AnnalsofMathematical

Statistics, 43:1470–1480, 1972.

[11] S. R. Eddy. HMMER: Profile hidden Markov mod-

elling. Bioinformatics, 14(9):755–763, 1998.

[12] E. Eskin, W. S. Noble, and Y. Singer. Protein family

classification using sparse markov transducers. Jour-

nal of Computational Biology, 10(2):187–214, 2003.

[13] P. G. Ferreira and P. J. Azevedo. Protein sequence

classification through relevant sequence mining and

bayes classifiers. In Proc. of EPIA, pages 236–247,

2005.

[14] P. G. Ferreira and P. J. Azevedo. Query driven se-

quence patternmining. InProc. ofSBBD, pages1–15,

2006.

[15] F. Horn et al. GPCRDB information system for G

protein-coupled receptors. Nucleic Acids Research,

31(1):294–297, 2003.

[16] D. Kang, A. Silvescu, and V. Honavar. RNBL-MN:

A recursive naive bayes learner for sequence classifi-

cation. In Proc. of PAKDD, pages 45–54, 2006.

[17] D. Kang, J. Zhang, A. Silvescu, and V. Honavar.

Multinomial event model based abstraction for se-

quence and text classification.

pages 134–148, 2005.

In Proc. of SARA,

[18] S. B. Kotsiantis and P. E. Pintelas. Increasing the clas-

sification accuracy of simple bayesian classifier. In

Proc. of AIMSA, pages 198–207, 2004.

[19] A. Krogh et al. Hidden markov models in compu-

tational biology: Applications to protein modeling.

Journal of Molecular Biology, 235:1501–1531, 1994.

[20] N. Lesh, M. J. Zaki, and M. Ogihara. Mining features

for sequence classification. In Proc. of KDD, pages

342–346, 1999.

[21] C. Leslie, E. Eskin, and W. Noble. Mismatch string

kernels for SVM protein classification. In Proc. of

NIPS, pages 1417–1424, 2002.

[22] D. J. Lipman et al. Basic local alignment search tool.

JournalofMolecularBiology, 215(3):403–410, 1990.

[23] R. Malouf. A comparison of algorithms for maximum

entropy parameter estimation. In Proc. of Sixth Conf.

on Natural Lang. Learning, pages 49–55, 2002.

[24] K. Marsolo and S. Parthasarathy. Protein classifica-

tion using summaries of profile-based frequency ma-

trices. In Proc. of BIOKDD06, pages 51–58, 2006.

[25] A. McCallum and K. Nigam. A Comparison of Event

Models for Naive Bayes Text Classification. In Proc.

of AAAI-98 Workshop on Learning for Text Catego-

rization, pages 41–48, 1998.

[26] I. Melvin et al. Multi-class protein classification us-

ing adaptive codes. Journal of Machine Learning Re-

search, 8:1557–1581, 2007.

[27] K.Nigam, J.Lafferty, andA.McCallum. Usingmaxi-

mumentropyfortextclassification. InProc.ofIJCAI-

99 Workshop on ML for Info. Filtering, pages 61–67,

1999.

[28] D. Pavlov. Sequence modeling with mixtures of con-

ditional maximum entropy distributions. In Proc. of

ICDM, pages 251–258, 2003.

[29] W. R. Pearson and D. J. Lipman.

for biological sequence comparison. Proc. National

Academy Sciences USA, 85(8):2444–2448, 1988.

Improved tools

[30] S. D. Pietra, V. J. D. Pietra, and J. D. Lafferty. Induc-

ing features of random fields. IEEE Tran. on Pattern

Analysis and Machine Intel., 19(4):380–393, 1997.

[31] P. Rani and V. Pudi. RBNBC: Repeat Based Naive

Bayes Classifier for Biological Sequences. To appear

in Proc. of ICDM’08: IEEE Int. Conf. on Data Min-

ing, 2008.

[32] A. Ratnaparkhi.

mum entropy models for natural language processing.

Technical report, IRCS Report 97-98, Univ. of Penn-

sylvania, 1997.

A simple introduction to maxi-

[33] A. Ratnaparkhi. Maximum Entropy Models for Natu-

ral Language Ambiguity Resolution. PhD thesis, Uni-

versity of Pennsylvania, 1998.

[34] N. Tatti.

itemsets. In Proc. of ICDM, pages 312–321, 2007.

Maximum entropy based significance of

[35] R. Thonangi and V. Pudi. Acme: An associative clas-

sifier based on maximum entropy principle. In Proc.

of ALT, pages 122–134, 2005.

[36] J.T.L.Wang, M.J.Zaki, H.Toivonen, andD.Shasha,

editors. Data Mining in Bioinformatics. Springer,

2005.

#### View other sources

#### Hide other sources

- Available from Vikram Pudi · Jun 6, 2014
- Available from psu.edu