Conference PaperPDF Available

Analysis and Transformation of Textual Energy Distribution


Abstract and Figures

In this paper we revisit the Textual Energy model. We deal with the two major disadvantages of the Textual Energy: the asymmetry of the distribution and the unbounded ness of the maximum value. Although this model has been successfully used in several NLP tasks like summarization, clustering and sentence compression, no correction of these problems has been proposed until now. Concerning the maximum value, we analyze the computation of Textual Energy matrix and we conclude that energy values are dominated by the lexical richness in quadratic growth of the vocabulary size. Using the Box-Cox transformation, we show empirical evidence that a log transformation could correct both problems.
Content may be subject to copyright.
Analysis and Transformation of Textual Energy Distribution
Alejandro Molina, Juan-Manuel Torres-Moreno, Eric SanJuan, Gerardo Sierraand Julio Rojas-Mora
e d’Avignon et des Pays de Vaucluse
Laboratoire Informatique d’Avignon
{alejandro.molina-villegas, juan-manuel.torres, eric.sanjuan}
Universidad Nacional Aut´
onoma de M´
Instituto de Ingenier´
{amolinav, gsierram}
Institute of Statistics
Universidad Austral de Chile
Abstract—In this paper we revisit the Textual Energy model.
We deal with the two major disadvantages of the Textual En-
ergy: the asymmetry of the distribution and the unboundedness
of the maximum value. Although this model has been success-
fully used in several NLP tasks like summarization, clustering
and sentence compression, no correction of these problems
has been proposed until now. Concerning the maximum value,
we analyze the computation of Textual Energy matrix and
we conclude that energy values are dominated by the lexical
richness in quadratic growth of the vocabulary size. Using the
Box-Cox transformation, we show empirical evidence that a
log transformation could correct both problems.
Keywords-weight model; Sentence Importance; Measures of
One of the major advances in Natural Language Process-
ing (NLP) is the possibility of processing textual content in
numerical ways. Since the Vector Space Model, proposed in
[1], it has been possible to represent texts by vectors. Each
element of the vector is assigned a value, called weight,
which represents the importance of a word in the text. The
most widely used technique, TF-IDF [2], assigns a value
according to the number of occurrences of a word in a text
(Term Frequency) and a numerical statistic which reflects
how important a word is to a document in a collection
of documents (Inverse Document Frequency). Other weight
models have been proposed as well; (see for example Latent
Semantic Indexing [3] and LexRank [4]), all of them shar-
ing the main advantage of numerical processing: language
However, in some cases, we must face the problem of
low frequencies in the vocabulary. This is common when
vectors represent sentences, phrases or short texts like tweets
and blog opinions. Therefore, we must consider alternative
weight models, adapted to low frequency vectors.
The Textual Energy model [5], [6] was conceived from
the beginning to work with unitary frequencies. In its
original definition, phrases are represented as binary vectors.
Nevertheless, this model presents two serious disadvantages
studied in this paper.
In section II we present the Textual Energy model. Then,
in section III we explain how textual energy values have
to be computed from scratch. In section IV, we present an
analysis of the upper boundary of textual energy based on
computations of section III. Finally, in section V we use
the Box-Cox transformation to process one thousand text
fragments and we conclude that a logarithmic transforma-
tion could correct the textual energy distribution avoiding
external parameters.
The Textual Energy model was originally inspired by a
physics model. The main idea is that words in a document
could be coded as magnetic spins with two possible states:
“word present” (represented by the spin +1) and “word
absent” (represented by the spin -1). Following this analogy,
every word is influenced by the others, directly or indirectly,
since they all belong to the same system. Thus, as the spins
influence each other in a physical system, the words in the
document interact giving sense to other words and coherence
to the whole text.
We used as a basis the ideas of the magnetic model pro-
posed by Ising [7]. In Ising’s model, the magnetic moment
of a material is coded by discrete variables that represent
spins; each variable has one of two possible values. It is
supposed that the spins interact with each others but they are
not influenced by external fields (which is a simplification of
the real interactions). Such a system might be represented
by a complete graph in which nodes represent spins and
arcs represent the interaction between spins. In Fig. 1, the
complete graph with eight units (K8) represents a system in
which three units (u1,u3and u6) have a positive spin and
five units (u2,u4,u5,u7and u8) have a negative one.
Hopfield adopted Ising’s ideas and applied them to Neural
Networks theory [8]. In the Hopfield network, every unit has
2013 12th Mexican International Conference on Artificial Intelligence
978-1-4799-2604-6/13 $31.00 © 2013 IEEE
DOI 10.1109/MICAI.2013.32
2013 12th Mexican International Conference on Artificial Intelligence
978-1-4799-2605-3/13 $31.00 © 2013 IEEE
DOI 10.1109/MICAI.2013.32
Figure 1. A complete graph with eight units K8representing a Hopfield
an activation state (in practice a binary variable) and units are
fully interconnected as in the Ising model. The connection
strength between two units uiand ujis described by a real
number representing the weight value: weight(ui,u
j). The
network has the following restrictions:
There are no auto-connections:
Connections are symmetric:
i)i, j.
One of the main applications of the Hopfield network is
the associative memory. The main idea is that the network
memorizes input patterns constantly repeated in the coded
input examples. Some of the input might contain noisy
information that the network would try to ignore. In the
training process, spins are adapted depending on the current
input. This adaptation of spins modifies weights according
to Hebb’s rule described in [9]. The result is a different
network state at each iteration and an associated value of
the state. At this stage, the network could change drastically
giving the impression of an electric discharge. That is the
reason why Hopfield called this value “the energy”. When
the system stops modifying weights it has converged to a
stable state of low energy. This allows the recognition of
learned patterns. Fig. 2 depicts the example of the network
in Fig. 1 converging to a minimum energy state.
One issue of the associative memory is that it could
quickly be saturated. Thus, only 14% of the patterns are
memorized correctly. This situation has limited its practical
applications, as discussed in [9]. Nevertheless, several au-
thors adapted this model to NLP and proposed the Textual
Energy showing good results in topic segmentation[6], sum-
marization [10], clustering [11] and sentence compression
Let Tbe the number of unique terms of a document,
the size of the vocabulary. We can represent a phrase ϕas
Figure 2. Example of energy changes in a Hopfield network.
a string of Tspins where the term icould be present or
absent. Thus, a document with Φphrases, is composed of
Φvectors in a space of dimensions ×T] represented by
a textual matrix.
Let (1) be the document-term matrix of a document with
Φphrases and Tterms.
A×T] =
a1,1··· a1,T
aΦ,1··· aΦ,T
In A,ai,j =tf(ϕi,w
j)is the frequency of the term wjin
the phrase ϕi.
As proposed in [5], [10], the interaction weights between
terms are computed using Hebb’s rule in its matricial form
by the product (2).
where ji,j Jdescribes the frequencies of terms occurring
in the same phrase and terms co-occurring with previously
related terms.
The Textual Energy matrix (3) is computed by two
successive products. The energy values, ei,j E, represent
the degree of relationship between ϕiand ϕj.
In [12], the following two major disadvantages of the
Textual Energy were found when authors tried to apply
it to measure the informativeness of intra-phrase discourse
Observation 1: Textual Energy values have no upper
bound, ei,j [0,), because the maximum energy value of
a text depends entirely on word local frequencies (integer
values). The consequence is the difficulty for comparing
documents through their Textual Energy values.
Observation 2: The distribution of Textual Energy values
is skewed and biased towards low values; perhaps following
Figure 3. Density plot of Textual Energy distribution for 1,266 discourse
a power law distribution. As a consequence, differences
among low energy values are not significant enough to
distinguish important phrases from unimportant ones.
Fig. 3 shows the density plot of the distribution of more
than one thousand discourse segments of the corpus of
sentence compression in Spanish1. We observe highly asym-
metric characteristics of Textual Energy values. In section IV
we theoretically analyze Observation 1 and in section V we
propose alternatives to Observation 2.
In order to find the upper bound of Textual Energy, we first
consider the easiest case, when the document-term matrix A
contains only binary values. Under these conditions, Textual
Energy values are maximal if all the terms appear in all the
phrases of the document, ai,j =1; ai,j A.
A×T] =
1··· 1
1··· 1
According to section III, we must compute (AAt)where
every element in the resulting matrix will be computed as
i=1(1 ×1):
  
··· 12+···+1
  
  
··· 12+···+1
  
1Corpus available at compression/data.
By (3), the Textual Energy values are those of matrix
(AAt)2where every element will be computed as Φ
T) resulting (6).
  
··· T2+···+T
  
  
··· T2+···+T
  
ΦT2··· ΦT2
ΦT2··· ΦT2
Consequently, Textual Energy values ei,j are O(ΦT2)for
the binary case. However, the number of phrases is always
less than the number of terms (φT). So we can say that
ei,j are O(T2).
In the general case where frequencies could be greater
than one, let us consider a constant cto be the maximum
frequency value and suppose that all term frequencies are
equal to this value; tf(ϕi,w
j)=c, i[1,Φ],j [1,T].
By analogy to (5), elements in (AAt)can be computed as
i=1(c×c)and according to (6), the elements of (AAt)2
will take the form Φ
  
It turns out that, in general, Textual Energy values ei,j are
O(ΦT2c4). However, we argue that it is always T2which
dominates the growth. Although the maximum frequency c
has the highest exponent, in practical situations, it is the size
of the vocabulary that determines the final energy values. In
real documents, the maximum term frequency in the same
sentence is 1c3. In fact, in our experiments with
one thousand discourse segments 1c2always holds.
Taking this into account, we can affirm that Textual Energy
values are O(kT2)where kc4could be considered a
This result means that lexical richness of a document
determines the maximum energy value. For instance, given
an artificial document with 1 000 word types (one page and a
half) we could reach Textual Energy values of 106assuming
a great lexical richness.
The Box-Cox transformation is typically used in asym-
metric distributions like that of Textual Energy. This trans-
formation is a continuous function with a single parameter
λwhich could be adjusted to correct the distribution [13] .
In the case of the Textual Energy, ei,j values are trans-
formed in eλ
i,j according to (7),
where K2corresponds to the geometric mean (8) and K1
depends on λand K2according to (9).
After iterating parameter λfrom -1 to 1 with a step of
0.001 we found that the best parameterization, the nearest
to a normal distribution, is obtained for λ=0.015.
Since the best parametrization is obtained when λ0,
we argue that textual energy can be corrected simply by
applying the logarithm function. As suggested in [13], in this
situation external parameters could be avoided for adjusting
the distribution of the values. Therefore, we propose (10) to
measure the energy (the informativeness) of a sentence ϕi
taking into account the whole document context.
Info(ϕi,Φ) = log(E(ϕi)+1)
argmaxϕiΦ(log(E(ϕi) + 1)).(10)
This last score has a quasi-normal distribution and is [0,1]
The density plot after transformation using (10) for the
same 1 266 discourse segments is shown in Fig. 4. We
observe that although there are still some problems for
values between 0 and 0.4, the curve looks smoother and
well distributed than that obtained using pure energy values
in Fig. 3. Advantages of the proposed transformation are
more clear when we observe the application to a real text.
Figure 4. Density plot of transformed Textual Energy distribution for 1
266 discourse segments.
Fig. 5 shows an example of the effect of transforming
Textual Energy values using a simple normalization (only
dividing by the maximum value) and using the score of (10).
The first column corresponds to the simple normalization
and the second one to the score. Grey tonalities were mapped
from real values in the [0,1] interval. Dark gray means low
values (near to 0) and light gray is used for high values (near
to 1). The dashed line in the middle corresponds to the global
mean value and the bars in the cells represent the deviation
from this value. We observe that after the log transformation
it would be possible to assign textual segments to two well
balanced classes, low energy and high energy, by comparing
deviations. Segments near the mean value would be consid-
ered of normal importance and segments above mean value
would be considered as the highly important ones. Besides,
the sensitivity of original Textual Energy has been increased
for less important segments. Before transformation, none of
the segment of the second sentence “Their survival depends
on their ability to regulate the expression of genes coding for
the enzymes and transport proteins, required for growth in
the altered environment” is considered as important. In fact,
only its second segment is considered non-zero.
We have improved the Textual Energy model which has
been used for scoring terms and phrases in Summarization,
Topic Segmentation, Text Classification, Sentence Compres-
sion and other tasks. We have analyzed the reasons of its
two major disadvantages: asymmetry of the distribution and
unboundedness of the maximum value. We have proved that
the maximum value of Textual Energy is determined by the
lexical richness of a document. In future experiments, we
will analyze this phenomenon empirically, using different
sizes of document corpora: tweets, Wikipedia documents and
scientific articles, in order to verify our hypothesis on short,
medium and long documents.
After applying the Box-Cox transformation to over one
thousand discourse segments we have found that the optimal
value of parameter λis very close to zero. According to the
Box-Cox theoretical framework, we suggest to correct the
Textual Energy distribution using a log transformation score.
We conclude that Textual Energy must be corrected before
its use for comparison proposes and the easiest way of
doing it, without external parametrization, is using a log
We would like to thank Peter Peinl and Patricia Velazquez
for all their help and support. This work was partially sup-
ported by the CONACyT grant 211963 and project 178248.
[1] G. Salton, The SMART Retrieval System – Experiments un
Automatic Document Processing. Prentice-Hall, Englewood
Cliffs, 1971.
[2] K. Sp¨
arck-Jones, “A statistical interpretation of term speci-
ficity and its application in retrieval,Journal of Documenta-
tion, vol. 1, no. 28, pp. 11–21, 1972.
[3] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,
and R. Harshman, “Indexing by Latent Semantic Analysis,”
Journal of the American Society for Information Science,
vol. 41, no. 6, pp. 391–407, 1990.
[4] G. Erkan and D. R. Radev, “LexRank: Graph-based Lexical
Centrality as Salience in Text Summarization,” Journal of
Artificial Intelligence Research, vol. 22, no. 1, pp. 457–479,
[5] S. Fern´
andez, E. SanJuan, and J.-M. Torres-Moreno, “En-
ergie textuelle des m´
emoires associatives,” in Proceedings of
Traitement Automatique de la Langue Naturelle (TALN’07),
vol. 1, Toulouse, France, 5-8 Juin 2007, pp. 25–34.
[6] S. Fern´
andez, E. SanJuan, and J.-M. Torres-Moreno, “Textual
energy of associative memories: performants applications of
enertex algorithm in text summarization and topic segmen-
tation,” in Mexican International Conference on Artificial
Intelligence (MICAI’07), Aguascalientes, Mexico, 2007, pp.
[7] D. J. Amit, H. Gutfreund, and H. Sompolinsky, “Statistical
mechanics of neural networks near saturation,” Annals of
Physics, vol. 173, no. 1, pp. 30–67, 1987.
[8] J. J. Hopfield and D. W. Tank, ““neural” computation of
decisions in optimization problems,” Biological cybernetics,
vol. 52, no. 3, pp. 141–152, 1985.
[9] J. A. Hertz, A. S. Krogh, and R. G. Palmer, Introduction to
the theory of neural computation. Westview press, 1991,
vol. 1.
[10] S. Fern´
andez, E. SanJuan, and J.-M. Torres-Moreno, “Enertex
: un syst`
eme bas´
e sur l’´
energie textuelle,” in Proceedings of
Traitement Automatique des Langues Naturelles (TALN’08),
Avignon, France, 2008, pp. 99–108.
[11] A. Molina, G. Sierra, and J.-M. Torres-Moreno, “La en-
ıa textual como medida de distancia en agrupamiento de
definiciones,” in Journ´
ees d’Analyse Statistique de Documents
(JADT’10), Rome, 2010.
[12] A. Molina, J.-M. Torres-Moreno, E. SanJuan, I. da Cunha,
and G. S. Mart´
ınez, “Discursive sentence compression,” in
Computational Linguistics and Intelligent Text Processing.
Springer, 2013, pp. 394–407.
[13] G. Box and D. Cox, “An analysis of transformations,Journal
of the Royal Statistical Society. Series B (Methodological), pp.
211–252, 1964.
Figure 5. Example of two transformations of textual energy values for discourse segments in a scientific abstract.
... We shall focus on theoretical objects that are usually considered in Statistical Physics. In magnetic system analysis, these are energy function distributions [11]. Hopfield himself used these functions to show that the recovery is convergent. ...
... Constraint (10) makes sure that a visitor should exit a room v after crossing or visiting it. Constraint (11) expresses that a visitor should have crossed at least a room before arriving at an arc a, while constraint (12) imposes that no flow is moving on the arc a, if it is not crossed in the visit tour. Constraint (13) means that if a room v is crossed in the tour, then the number of rooms crossed before arriving at this room equals the number of rooms crossed after leaving v minus one. ...
Full-text available
This paper proposes a new method to provide personalized tour recommendation for museum visits. It combines an optimization of preference criteria of visitors with an automatic extraction of artwork importance from museum information based on Natural Language Processing using textual energy. This project includes researchers from computer and social sciences. Some results are obtained with numerical experiments. They show that our model clearly improves the satisfaction of the visitor who follows the proposed tour. This work foreshadows some interesting outcomes and applications about on-demand personalized visit of museums in a very near future.
In this paper, we present a similarity-based approach towards paraphrase detection in Spanish. We evaluate various models for semantic similarity computation using a gold-standard paraphrase corpus. It contains one original document and paraphrased documents on different levels (low and high), and reference documents on the same topic or same vocabulary. It allows to assess the similarity between a pair of texts or individual sentences. We found that some of the similarity metrics have a larger difference when comparing paraphrased sentences than others. Finally, we obtained a threshold for each of the similarity metrics with the aim of determining a classification boundary to decide if two sentences are paraphrased.
Full-text available
In this article we show how power transformations can be used as a common framework for the derivation of local term weights. We found that under some parametric conditions, BM25 and inverse regression produce equivalent results. As a special case of inverse regression, we show that the largest increment in term weight occurs when a term is mentioned for the second time. A model based on inverse regression (BM25IR) is presented. Simulations suggest that BM25IR works fairly well for different BM25 parametric conditions and document lengths.
Full-text available
Resumen La consulta de definiciones es una de las tareas mas comunes en los sitios de tipo enciclopédico como Wikipedia, Encarta y Medline. La detección, clasificación y agrupamiento de definiciones son tareas recientemente introducidas y en creciente desarrollo. Estas tareas se complican cuando las definiciones están inmersas en textos recuperados de la Web. Presentamos un algoritmo de clasificación basado en una nueva medida de distancia entre definiciones derivada de la energía textual calculada a partir de una representación vectorial del texto, independiente del idioma. Esta distancia puede tener aplicaciones en agrupamiento de textos cortos como snippets y títulos, para los cuales resulta complicado utilizar técnicas clásicas de ponderación como tf-idf porque sus frecuencias son muy bajas. Los resultados obtenidos son bastante alentadores y dan pie a explorar otras propiedades de la distancia propuesta. Abstract Definition searching is the most common query in encyclopedic system sites such as Wikipedia, Encarta and Medline. The detection, classification and clustering of definitions are recently introduced tasks in increasing development. These tasks become even more complicated when those definitions are embedded in texts and recovered from the sites as they appear. We present here a clustering algorithm based on a new measure of distance between definitions derived from the textual energy that can be calculated from a text vector representation, which is language independent. The energy distance suggested in this work may also have application for short texts clustering such as snippets and titles, where is difficult to use the classic techniques of weighting as tf-idf since the frequencies of terms are very low. The results are quite encouraging and lead us to explore other properties of the proposed distance measure.
Conference Paper
Full-text available
This paper presents a method for automatic summarization by deleting intra-sentence discourse segments. First, each sentence is divided into elementary discourse units and, then, less informative segments are deleted. To analyze the results, we have set up an annotation campaign, thanks to which we have found interesting aspects regarding the elimination of discourse segments as an alternative to sentence compression task. Results show that the degree of disagreement in determining the optimal compressed sentence is high and increases with the complexity of the sentence. However, there is some agreement on the decision to delete discourse segments. The informativeness of each segment is calculated using textual energy, a method that has shown good results in automatic summarization.
Full-text available
In this paper we present a neural networks approach, inspired by statistical phy- sics of magnetic systems, to study fundamental problems in Natural Language Processing. The algorithm models documents as neural network whose textual energy is studied. We obtained good results on the application of this method to automatic summarization and thematic borders
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.