PreprintPDF Available

Textual Data Represented using Group Theory and NLP Problem Solving -An Abstract Mathematical Model

Preprints and early-stage research may not have been peer reviewed yet.


This article lays emphasis on how Abstract Algebra topic of Group Theory can be applied to the problems of Natural Language Processing. Here the monoid is defined for certain NLP tasks and the proposed extension of the free semigroup to free Group is illustrated for its meaning in several NLP tasks. The article concludes with laying out of some relevant results which can be applied from theoretical concepts to applied NLP applications. The article is for illustrative emphasis of application of Group Theory in NLP.
a Dr. Nidhika Yadav received her PhD degree in Nov, 2021. She can be contacted on email id:
Textual Data Represented using Group Theory and NLP Problem
Solving An Abstract Mathematical Model
Nidhika Yadava
Abstract- This article lays emphasis on how Abstract Algebra topic
of Group Theory can be applied to the problems of Natural
Language Processing. Here the monoid is defined for certain NLP
tasks and the proposed extension of the free semigroup to free
Group is illustrated for its meaning in several NLP tasks. The article
concludes with laying out of some relevant results which can be
applied from theoretical concepts to applied NLP applications. The
article is for illustrative emphasis of application of Group Theory in
1. Introduction
Consider a problem at hand of Textual Data, let U be the universe of text units in the domain of the
problem. The text units can be phrases, sentences or documents themselves. Hence U can be a collection
of phrases, or individual texts or collection of documents from some server on internet. Let W be a non-
empty subset of U and W* be defined as
as a one-one, onto mapping. Here cardinality(W) = cardinality(W*)
The textual data can be one of the following:
1. A collection of phrases, the individual units are phrases here. One possible application is text
summarization to create abstracts of texts.
2. A collection of sentences, the individual units are sentences here. The possible application here is
text extraction, wherein full independent sentences are selected.
3. A collection of documents, the individual units are documents here. One possible application is
information retrieval from all the sentences.
In any of the above three cases the length of text shall be measures as number of individual units of the
text. For example, in case of text abstraction, the Universe is collection of phrases and the length of text
is number of phrases in the Universe under consideration.
Define, the formal text data under consideration as T, defined as follows,
Let, T = W W*
T1 = T = {t | t T}
T2 = {t1t2 | t1, t2 T}
The operation here is formal product also known as juxtaposition.
Tn = { | t1, t2. tn T}, these can be called as texts of length n. These can be n phrases or n
sentences or even n documents, as the case of application may be.
T0= {} = The empty text unit
Let, the document under consideration be D, then, D is defined as follows,
D = 
 , this represents text units of all possible finite lengths.
It can be proved that D is a free monoid with identity generated by T, with formal products as
composition. To understand W*, consider a phrase not good food and very good food. These are two
phrases and when juxta positioned with each other will yield the empty set. This is called cancellation
law in Group Theory and can be meant to be as cancelling the effect of texts fragments, here phrases. In
a way this is like removing extra contents from text data. Now, this is a free semigroup with identity but
still not a Group.
How does making a group with this data beneficial? The answer is that if we are able to make it a group,
then we have a wide range of operations, results and application available in Abstract Algebra that can
be used to solve problems, if not on go, then from theoretical point of view, it shall be, for sure an abstract
application of Group Theory in NLP. This is the motivation of this article. The next section briefly
describes how the textual data can be represented as a Group. Section 3 illustrates some application areas
in NLP for this abstract model application. Section 4 describes some results form Group Theory which
can be beneficial to the task of NLP, represented in terms of abstract operations of Group Theory.
2. Proposed Text data represented as an Abstract Group
w  w, if w can be obtained from w by finite number of insertion and deletions of words from S.
Logically this can be viewed as selecting text fragments which are equivalent in representation via the
representations. This can mean, we include certain phrases with negation in them in between and still
get the same text fragment. Consider this as a text abstraction problem, in which an abstract means the
same thing as the complete text. Hence, we have the following,
Abstract full_text
The same illustration hold for extractive summary wherein
Extract full_text
In case of information retrieval, the set S is collection of all documents. And the output generated by
the information retrieval model say O, then, ideally,
O S, the input fed in the IR engine.
It can be easily verified that is an equivalence relation which is well-defined, reflexive, symmetric
and transitive. In case you have problem proving more details can be provided. The equivalence for
relation is defined as:
[w] is all elements of S which can be obtained by finite number of insertion and deletions of elements
of S. Hence the equivalence class, hypothetically defines, all the equivalent formulations for w. Let us
write [w] as [w] for simplicity hereafter in this article. [w1] = [w2] w1 = w2. This this is well-
defined, note that w1 and w2 are representations of text, while the class defines the meaning of w1 and
Then, it can be worked out that
G = D| = { [t]    } is a group with the following composition,
[t1][t2] = [t1t2]
We note, that this is well-defined,
As if t1 can be obtained from t1 by finite number of inclusion or removals of text fragment from t1 then
same hold for t1t2. This can be used to proved well defines of the composition. Associativity is trivial to
prove for text fragments. Identity is presence of counter argument or opposite documents in case if
Information Retrieval problem. The empty class is the identity formed by pairs such as [ss*], where for
example s is the phrase = going to school while s* is the phrase not going to school.
The last thing remaining to prove it as a Group is to find inverse of each [t1] in G. Now, let x be trival
text fragment, then
[t1t1*] = []
 = [t1*]
   
For a complex member t of text T,
[tt*] = []
 = [t*]
   
The inverse can be computed using the following,
Let [s] = [], where each ti is a text fragment from S. Then  =     = [tn*.t1*]
This makes G a group with the given relation Note that the relation is itself an interesting relation
with applications and real-world significance as described before. The next section describes some
application areas of this Group so defined with the above composition.
3. Application areas in NLP
The above composition of the group is quite intuitive in NLP applications. Some of the applications
a. Text Abstraction vis-à-vis creating a reframed text fragment using key phrases as it is.
This task of NLP is creating smaller subset of text, typically rephrased from original text. The
Group to be defined here is the phrases, Group can be made and the application of relevant
theorems in a generative mode can theoretically claim to provide the phrases which are important.
Sentences can be generated from phrases using Language Model.
b. Text Summarization via extraction the important text sentences.
c. Information Retrieval via selection of important documents from the given query text
d. Word Sense Disambiguation in which the context of a sense is matched well with the occurrence in
which the word appears.
e. Machine Translation in which the meanings are defined not just the representation, here the
definition of the equivalence class changes to same representation in Unified Global Language.
All of these are popular NLP tasks, I leave it to the user to formulate the problem for each in terms of the
Group and the equivalence relation definition meaning in each case. For example, for the Information
Retrieval problem, we can define a group action on G as follows:
    
   
can be defined as the common text fragments between the two inputs to . Here G acts on Q, the
query and the resulting output is an action of G on Query set Q. This can be proved to be as a group
action satisfying     and   
The main concern of the article is to lay emphasis that the textual NLP tasks can be modelled using
Abstract Group Theory and the benefits we get out of it, is the plethora of results of Group Theory that
comes with this. The next section points out some of the key results which can be applied to the above-
described NLP tasks.
4. Some Results from Group Theory
In this section we describe some results form Group Theory that can be applied to the NLP problems
represented as an Abstract Group. The results are taken form the refences and important ones are given
as follows with some significance to the NLP task at hand.
Theorem 4.1. Every finitely generated abelian group is finite direct product of cyclic subgroups.
Now, to understand the significance of this theorem in NLP, it can be seen if we can prove that as per
meanings of text fragments, if the output is independent of manner in which operations are performed
then the Group defined above is abelian w.r.t. the relation defined. This is logical as well, since, the
ranking and arrangement of output can be done as a post processing step. If we consider equality as
equality of meaning of text fragments then the Group is an abelian group. And by the very task in which
it is applied it is finitely generated, by the hypothesis. The theorem hence says that is direct product of
cyclic subgroups, this is a powerful statement, meaning we can find the generator of the key concepts
covered in the text or even in a server document collection. The very idea of how to find these cyclic can
be found in the proof of the original theorem. The equivalence classes to have same meaning can be
determined via the embedding based models.
Theorem 4.2. If G is finitely generated by t1,.,tn and if t is formal product of t1, , tn each repeated
ai times, where the coefficients have G.C.D as 1. The G is finitely generated by t,t2tn.
This theorem states that G can be finitely generated by an equivalent textual representation as well.
Theorem 4.3. Every finite Nilpotent Group is a direct product of its distinct Sylow subgroups.
This theorem is just as important as Theorem 4.2 and do not require the Group to be abelian and hence
we can use the definitions as proposed in this article and not require equivalence classes to be same when
meanings are same. Note the converse of this theorem also hold true, this means that we can once we
prove the direct product, this means the group is Nilpotent.
Theorem 4.4. If G is a finite group acting on a finite non-empty set X. Then the size of orbit of x is the
index of stabilizer subgroup.
The significance can be used in the Information Retrieval problem, as defined in Section 3. The theorem
saying a group is direct product of cyclic subgroups, has many applications for texts stored on server
document collection and even to small text files for text highlighting, extraction or abstraction and even
to Word Sense Disambiguation. The problem which makes two equivalence classes defined in Section 2
as equivalent based on meanings is called semantic equivalence and those that use the representations as
defined in Section 2 are called syntactic problems, as these takes into account only the juxta position
5. Conclusion and Future Work
The aim of the article was to provide an abstract algebra based application of NLP problems. The way
in which NLP problems can be represented by Group Theory and hence to use elaborate Group Theory
theorems, lemmas and results to the area of NLP. This is an illustrative proposal, real time
implementation need to be computed on this areas. And these can be modelled using experimental models
and automated theorem provers. The future work, is to provide more elaborate textual representation and
the significance of the abstract algebra proofs in solving NLP problems both semantically and
1. Topics in Algebra, Herstein
2. Algebra I, II, by Lang
3. Algebra, by M.Artin
4. Algebra, by P.M. Cohn
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.