AN IMPROVED SCORING SCHEME FOR PREDICTING GLYCAN
STRUCTURES FROM GENE EXPRESSION DATA
AKITSUGU SUGA YOSHIHIRO YAMANISHI KOSUKE HASHIMOTO
email@example.com firstname.lastname@example.org email@example.com
SUSUMU GOTO MINORU KANEHISA
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho,
Uji, Kyoto 611-0011, Japan
The prediction of glycan structures from gene expression of glycosyltransferases (GTs) is a
challenging new area in computational biology because the biosynthesis of glycan chains is under
the control of GT expression. In this paper we developed a new method for predicting glycan
structures from gene expression data. There are two main original aspects of the proposed method.
First, we proposed to increase the number of predictable glycan structure candidates by estimating
missing glycans from a global glycan structure map, which enables us to predict new glycan
structures that are not stored in the database. Second, we proposed a more general scoring scheme
based on real-valued gene expression intensity rather than converting it into binary information. In
the result we applied the proposed method to predicting cancer-specific glycan structures from gene
expression profiles for patients of acute lymphocytic leukemia (ALL) and acute myelocytic leukemia
(AML). We confirmed that several of the predicted glycan structures successfully correspond to
known cancer-specific glycan structures according to the literature, and our method outperforms the
previous methods at a statistically significant level.
Keywords: glycosyltransferase; glycan structure; DNA microarray; gene expression.
Glycans are carbohydrate chains attached to lipids or proteins and are notable as the third
type of biological chain next to DNA and proteins, since they have a huge variety of
structures and play key roles in a wide variety of biological processes, such as immunity
and disease pathogenesis. Pathogens have evolved to exploit host lineage-specific
glycans and are constantly shaping the glycomes of their hosts . It is well known that
some N-linked glycans are necessary in proper protein folding in eukaryote and specific
glycan structures are expressed in carcinoma samples . In addition, some glycans are
involved in cell adhesion . Understanding glycan functions requires determining
glycan structures, as well as genome and amino-acid sequences.
Some powerful experimental instruments for glycan purification and analysis have
been developed and successively improved, such as high-performance liquid
chromatography, capillary electrophoresis, mass spectrometry and nuclear magnetic
resonance technology . In addition, a variety of computational tools have recently
been developed, such as automatic annotation tools for mass spectrometry , glycan
238 A. Suga et al.
structure matching methods , glycan composite structure maps  and glycan
structure prediction methods . However, even with these advances, the experimental
determination and computational analysis of glycan structures is still difficult. This is
because glycans have more complicated structures than DNA and proteins. While
nucleotide and amino acid chains are linear and consist of 4 and 20 elementary
components, respectively, glycan chains are branched structures and consist of a number
of monosaccharides. In addition, they are multivalent, and linkages have anomeric
configurations (alpha and beta).
Recently, Kawano et al. developed a method for predicting glycan structures based
on microarray gene expression data . The basic idea of their method stems from the
fact that glycan biosynthesis is under the control of the expression of glycosyltransferases
(GTs). If the expression level of GTs is known in the transcriptome or in the proteome
of a given organism, it should be possible to predict the repertoire of glycan structures
related with the experimental conditions of expression data such as tissues, organs, and
diseases. In their method, the gene expression information of GTs is used in the
prediction process. However, there are some limitations in Kawano’s method. First, the
number of predictable glycans depends on the number of glycans stored in the database,
because their prediction is based on a database search. Secondly, the prediction accuracy
is far from ideal at practical levels, because their method can treat only binary value
information of microarray gene expression data.
In this study, we propose a new method to predict glycan structures from gene
expression profiles by improving on the framework of Kawano’s method. First, we
introduced a strategy of predicting missing glycans, which are not stored in the glycan
database, in order to add new glycan structures into our candidate set using the glycan
composite structure map. Next, we proposed a new scoring scheme to use the original
real-valued expression values, the so-called ‘signal’, from the microarray data, rather
than using binary values, the so-called ‘detection’, because gene expression levels are
observed with real-valued signals in most microarray data in nature. Finally, we applied
the proposed method to an experimental gene expression dataset from acute lymphocytic
leukemia (ALL) and acute myelocytic leukemia (AML) in order to predict cancer
specific glycan structures. As a result, we found that the proposed method outperform
Kawano’s method in terms of the number of correctly predicted cancer-specific glycan
Materials and Methods
To construct a GT reaction pattern library, GT genes were obtained from the human
genome in the KEGG GENES database based on their annotations . The reaction
246 A. Suga et al.
We would like to express our gratitude to Dr. Nelson Hayes for helpful comments and
overall improvement of our manuscript. This work was supported by grants from the
Ministry of Education, Culture, Sports, Science and Technology of Japan and the Japan
Science and Technology Agency, as well as a bridging grant from the NIH/NIGMS
Consortium for Functional Glycomics and a research fellowship for young scientists
from the Japan Society for the Promotion of Science. The computational resources were
provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto
University and the Human Genome Center, Institute of Medical Science, The University
 Akama, T. O., Nakagawa, H., Sugihara, K., Narisawa, S., Ohyama, C., Nishimura, S.,
O'brien, D. A., Moremen, K. W., Millan, J. L., and Fukuda, M. N., Germ cell survival
through carbohydrate-mediated interaction with Sertoli cells, Science, 295(5552):124-
 Aoki, K. F., Yamaguchi, A., Ueda, N., Akutsu, T., Mamitsuka, H., Goto, S., and
Kanehisa, M., KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing
the structures of carbohydrate sugar chains, Nucleic Acids Res., 32(Web Server
 Bishop, J. R. and Gagneux, P., Evolution of carbohydrate antigens--microbial forces
shaping host glycomes?, Glycobiology, 17(5):23R-34R, 2007.
 Goldberg, D., Sutton-Smith, M., Paulson, J., and Dell, A., Automatic annotation of
matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics, 5(4):865-
 Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and
Lander, E. S., Molecular classification of cancer: class discovery and class prediction
by gene expression monitoring, Science, 286(5439):531-537, 1999.
 Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K. F., Ueda, N., Hamajima, M.,
Kawasaki, T., and Kanehisa, M., KEGG as a glycome informatics resource,
Glycobiology, 16(5):63R-70R, 2006.
 Kawano, S., Hashimoto, K., Miyama, T., Goto, S., and Kanehisa, M., Prediction of
glycan structures from gene expression data based on glycosyltransferase reactions,
Bioinformatics, 21(21):3976-3982, 2005.
 Kim, Y. J. and Varki, A., Perspectives on the significance of altered glycosylation of
glycoproteins in cancer, Glycoconj J., 14(5):569-576, 1997.
 von der Lieth, C. W., Bohne-Lang, A., Lohmann, K. K., and Frank, M.,
Bioinformatics for glycomics: status, methods, requirements and perspectives, Brief.
Bioinform., 5(2):164-178, 2004.