Question
Asked 3rd Jan, 2014

Are there any efficient stemming algorithms in addition to the Porter and Carry algorithms?

I am currently working on natural language processing using French language electronic texts with the objective of obtaining a statistical overview of a text in view of developing a domain ontology.

Most recent answer

29th Mar, 2016
Miral Patel
G H Patel College of Engineering and Technology (GCET)
is it possible to apply Indian languages to snowball ? like Hindi, marathi?

Popular Answers (1)

3rd Jan, 2014
Patrice Bellot
Aix-Marseille Université
For French, you can try Snowball: http://snowball.tartarus.org
or the one developed by J. Savoy (Neuchâtel) : http://members.unine.ch/jacques.savoy/clef/frenchStemmerPlus.txt
Stemming is more complex for French than for English. You can test an other approach : Pos tagging + lemmatization
--> take a look at TreeTagger :
TreeTagger is free for research purpose.
By using it you can index/count lemmas (and not stems).
6 Recommendations

All Answers (7)

3rd Jan, 2014
Hung Quoc Ngo
University of Information Technology, Vietnam National University - HCMCity
For English stemming, we can use a combination of the Porter and a morphology dictionary.
The dictionary for English is here:
I hope that it is useful.
NgoHung
3rd Jan, 2014
Patrice Bellot
Aix-Marseille Université
For French, you can try Snowball: http://snowball.tartarus.org
or the one developed by J. Savoy (Neuchâtel) : http://members.unine.ch/jacques.savoy/clef/frenchStemmerPlus.txt
Stemming is more complex for French than for English. You can test an other approach : Pos tagging + lemmatization
--> take a look at TreeTagger :
TreeTagger is free for research purpose.
By using it you can index/count lemmas (and not stems).
6 Recommendations
3rd Jan, 2014
Marc Carmen
Brigham Young University - Provo Main Campus
You can try some different stemming algorithms that are included in the Python NLTK library at http://text-processing.com/demo/stem/
1 Recommendation
7th Jan, 2014
Alan Craig Allred
University of Texas Southwestern Medical Center
From the SOLR project -
French
Solr includes three stemmers for French: one via solr.SnowballPorterFilterFactory, an alternative stemmer <!> Solr3.1 via solr.FrenchLightStemFilterFactory, and an even less aggressive approach <!> Solr3.1 via solr.FrenchMinimalStemFilterFactory. Solr can also removing elisions via solr.ElisionFilterFactory, and Lucene includes an example stopword list.
...
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<!-- do word delimiter, etc here -->
<filter class="solr.SnowballPorterFilterFactory" language="French" />
...
Example set of French stopwords (Be sure to switch your browser encoding to UTF-8)
<!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This will prevent very slow phrase queries.
7th Jan, 2014
Simon John Carolan
Learnalign
Thank you for your contributions. I have indeed been hearing more and more about the potential when using snowball and it seems to be adapted to my case study. I will run some trials and keep you posted on progress.
31st Jan, 2014
Ibrahim Bounhas
Université de la Manouba
I think Treetagger

Similar questions and discussions

Related Publications

Article
Construction of a semantic network from scratch is a long process that usually requires both linguistic work done by hand and semi-automatic methods to add or translate new data which must be subsequently reviewed by human lexicographers. In this process, many systemic and/or language-specific errors usually appear in the data over time. Maintainin...
Article
This article presents a methodology for the analysis of data on the Internet, combining techniques of Big Data analytics, NLP and semantic web in order to find knowledge about large amounts of information on the web. To test the effectiveness of the proposed method, webpages about cyberterrorism were analyzed as a case study. The procedure implemen...
Conference Paper
Full-text available
The paper presents RetFig, a formal domain ontology of rhetorical figures for Serbian. This ontology is one of the necessary steps in developing tools for Natural Language Processing in the Serbian language, especially for tools pertinent to discourse analysis, sentiment analysis and opinion mining. The RetFig ontology was developed taking into acc...
Got a technical question?
Get high-quality answers from experts.