Content uploaded by Nidhika Yadav

Author content

All content in this area was uploaded by Nidhika Yadav on Nov 18, 2021

Content may be subject to copyright.

a Dr. Nidhika Yadav received her PhD degree in Nov, 2021. She can be contacted on email id:

nidhikayadav678@outlook.com.

Textual Data Represented using Group Theory and NLP Problem

Solving – An Abstract Mathematical Model

Nidhika Yadava

nidhikayadav678@outlook.com

Abstract- This article lays emphasis on how Abstract Algebra topic

of Group Theory can be applied to the problems of Natural

Language Processing. Here the monoid is defined for certain NLP

tasks and the proposed extension of the free semigroup to free

Group is illustrated for its meaning in several NLP tasks. The article

concludes with laying out of some relevant results which can be

applied from theoretical concepts to applied NLP applications. The

article is for illustrative emphasis of application of Group Theory in

NLP.

1. Introduction

Consider a problem at hand of Textual Data, let U be the universe of text units in the domain of the

problem. The text units can be phrases, sentences or documents themselves. Hence U can be a collection

of phrases, or individual texts or collection of documents from some server on internet. Let W be a non-

empty subset of U and W* be defined as

W

W*

w

w*

as a one-one, onto mapping. Here cardinality(W) = cardinality(W*)

The textual data can be one of the following:

1. A collection of phrases, the individual units are phrases here. One possible application is text

summarization to create abstracts of texts.

2. A collection of sentences, the individual units are sentences here. The possible application here is

text extraction, wherein full independent sentences are selected.

3. A collection of documents, the individual units are documents here. One possible application is

information retrieval from all the sentences.

In any of the above three cases the length of text shall be measures as number of individual units of the

text. For example, in case of text abstraction, the Universe is collection of phrases and the length of text

is number of phrases in the Universe under consideration.

Define, the formal text data under consideration as T, defined as follows,

Let, T = W W*

T1 = T = {t | t T}

T2 = {t1t2 | t1, t2 T}

2

The operation here is formal product also known as juxtaposition.

..

Tn = {t1t2….tn | t1, t2…. tn T}, these can be called as texts of length n. These can be n phrases or n

sentences or even n documents, as the case of application may be.

T0= {} = The empty text unit

Let, the document under consideration be D, then, D is defined as follows,

D =

, this represents text units of all possible finite lengths.

It can be proved that D is a free monoid with identity generated by T, with formal products as

composition. To understand W*, consider a phrase “not good food” and “very good food”. These are two

phrases and when juxta positioned with each other will yield the empty set. This is called cancellation

law in Group Theory and can be meant to be as cancelling the effect of texts fragments, here phrases. In

a way this is like removing extra contents from text data. Now, this is a free semigroup with identity but

still not a Group.

How does making a group with this data beneficial? The answer is that if we are able to make it a group,

then we have a wide range of operations, results and application available in Abstract Algebra that can

be used to solve problems, if not on go, then from theoretical point of view, it shall be, for sure an abstract

application of Group Theory in NLP. This is the motivation of this article. The next section briefly

describes how the textual data can be represented as a Group. Section 3 illustrates some application areas

in NLP for this abstract model application. Section 4 describes some results form Group Theory which

can be beneficial to the task of NLP, represented in terms of abstract operations of Group Theory.

2. Proposed Text data represented as an Abstract Group

Define,

w w, if w can be obtained from w by finite number of insertion and deletions of words from S.

Logically this can be viewed as selecting text fragments which are equivalent in representation via the

representations. This can mean, we include certain phrases with negation in them in between and still

get the same text fragment. Consider this as a text abstraction problem, in which an abstract means the

same thing as the complete text. Hence, we have the following,

Abstract full_text

The same illustration hold for extractive summary wherein

Extract full_text

In case of information retrieval, the set S is collection of all documents. And the output generated by

the information retrieval model say O, then, ideally,

O S, the input fed in the IR engine.

3

It can be easily verified that is an equivalence relation which is well-defined, reflexive, symmetric

and transitive. In case you have problem proving more details can be provided. The equivalence for

relation is defined as:

[w] is all elements of S which can be obtained by finite number of insertion and deletions of elements

of S. Hence the equivalence class, hypothetically defines, all the equivalent formulations for w. Let us

write [w] as [w] for simplicity hereafter in this article. [w1] = [w2] w1 = w2. This this is well-

defined, note that w1 and w2 are representations of text, while the class defines the meaning of w1 and

w2.

Then, it can be worked out that

G = D| = { [t] } is a group with the following composition,

[t1][t2] = [t1t2]

We note, that this is well-defined,

As if t1 can be obtained from t1 by finite number of inclusion or removals of text fragment from t1 then

same hold for t1t2. This can be used to proved well defines of the composition. Associativity is trivial to

prove for text fragments. Identity is presence of counter argument or opposite documents in case if

Information Retrieval problem. The empty class is the identity formed by pairs such as [ss*], where for

example s is the phrase = “going to school” while s* is the phrase “not going to school”.

The last thing remaining to prove it as a Group is to find inverse of each [t1] in G. Now, let x be trival

text fragment, then

[t1t1*] = []

= [t1*]

For a complex member t of text T,

[tt*] = []

= [t*]

The inverse can be computed using the following,

Let [s] = [t1t2….tn], where each ti is a text fragment from S. Then = = [tn*….t1*]

This makes G a group with the given relation Note that the relation is itself an interesting relation

with applications and real-world significance as described before. The next section describes some

application areas of this Group so defined with the above composition.

3. Application areas in NLP

The above composition of the group is quite intuitive in NLP applications. Some of the applications

are:

a. Text Abstraction vis-à-vis creating a reframed text fragment using key phrases as it is.

This task of NLP is creating smaller subset of text, typically rephrased from original text. The

Group to be defined here is the phrases, Group can be made and the application of relevant

theorems in a generative mode can theoretically claim to provide the phrases which are important.

Sentences can be generated from phrases using Language Model.

4

b. Text Summarization via extraction the important text sentences.

c. Information Retrieval via selection of important documents from the given query text

d. Word Sense Disambiguation in which the context of a sense is matched well with the occurrence in

which the word appears.

e. Machine Translation in which the meanings are defined not just the representation, here the

definition of the equivalence class changes to same representation in Unified Global Language.

All of these are popular NLP tasks, I leave it to the user to formulate the problem for each in terms of the

Group and the equivalence relation definition meaning in each case. For example, for the Information

Retrieval problem, we can define a group action on G as follows:

can be defined as the common text fragments between the two inputs to . Here G acts on Q, the

query and the resulting output is an action of G on Query set Q. This can be proved to be as a group

action satisfying and

The main concern of the article is to lay emphasis that the textual NLP tasks can be modelled using

Abstract Group Theory and the benefits we get out of it, is the plethora of results of Group Theory that

comes with this. The next section points out some of the key results which can be applied to the above-

described NLP tasks.

4. Some Results from Group Theory

In this section we describe some results form Group Theory that can be applied to the NLP problems

represented as an Abstract Group. The results are taken form the refences and important ones are given

as follows with some significance to the NLP task at hand.

Theorem 4.1. Every finitely generated abelian group is finite direct product of cyclic subgroups.

Now, to understand the significance of this theorem in NLP, it can be seen if we can prove that as per

meanings of text fragments, if the output is independent of manner in which operations are performed

then the Group defined above is abelian w.r.t. the relation defined. This is logical as well, since, the

ranking and arrangement of output can be done as a post processing step. If we consider equality as

equality of meaning of text fragments then the Group is an abelian group. And by the very task in which

it is applied it is finitely generated, by the hypothesis. The theorem hence says that is direct product of

cyclic subgroups, this is a powerful statement, meaning we can find the generator of the key concepts

covered in the text or even in a server document collection. The very idea of how to find these cyclic can

be found in the proof of the original theorem. The equivalence classes to have same meaning can be

determined via the embedding based models.

Theorem 4.2. If G is finitely generated by t1,….,tn and if t is formal product of t1, …, tn each repeated

ai times, where the coefficients have G.C.D as 1. The G is finitely generated by t,t2…tn.

This theorem states that G can be finitely generated by an equivalent textual representation as well.

Theorem 4.3. Every finite Nilpotent Group is a direct product of its distinct Sylow subgroups.

5

This theorem is just as important as Theorem 4.2 and do not require the Group to be abelian and hence

we can use the definitions as proposed in this article and not require equivalence classes to be same when

meanings are same. Note the converse of this theorem also hold true, this means that we can once we

prove the direct product, this means the group is Nilpotent.

Theorem 4.4. If G is a finite group acting on a finite non-empty set X. Then the size of orbit of x is the

index of stabilizer subgroup.

The significance can be used in the Information Retrieval problem, as defined in Section 3. The theorem

saying a group is direct product of cyclic subgroups, has many applications for texts stored on server

document collection and even to small text files for text highlighting, extraction or abstraction and even

to Word Sense Disambiguation. The problem which makes two equivalence classes defined in Section 2

as equivalent based on meanings is called semantic equivalence and those that use the representations as

defined in Section 2 are called syntactic problems, as these takes into account only the juxta position

products.

5. Conclusion and Future Work

The aim of the article was to provide an abstract algebra based application of NLP problems. The way

in which NLP problems can be represented by Group Theory and hence to use elaborate Group Theory

theorems, lemmas and results to the area of NLP. This is an illustrative proposal, real time

implementation need to be computed on this areas. And these can be modelled using experimental models

and automated theorem provers. The future work, is to provide more elaborate textual representation and

the significance of the abstract algebra proofs in solving NLP problems both semantically and

syntactically.

References

1. Topics in Algebra, Herstein

2. Algebra I, II, by Lang

3. Algebra, by M.Artin

4. Algebra, by P.M. Cohn