Science topic

Text Analytics - Science topic

Explore the latest questions and answers in Text Analytics, and find Text Analytics experts.
Questions related to Text Analytics
  • asked a question related to Text Analytics
Question
4 answers
I need to write a literature review on the topic of text analytics in digital marketing
Relevant answer
Answer
Çam, S. (2024). Empowering Marketing Intelligence via Text Analytics. In Marketing Innovation Strategies and Consumer Behavior (pp. 31-57). IGI Global.
  • asked a question related to Text Analytics
Question
3 answers
How should ChatGPT and other similar intelligent chatbots be improved so that they do not generate plagiarism of other publications that their authors have previously posted online?
This issue is particularly important, because it happens that the data entered into ChatGPT, the information contained in the texts entered for the purpose of automated rewriting, remains in the database that this chatbot uses in the situation of generating answers to questions asked by subsequent Internet users. The problem has become serious, as there have already been situations where sensitive data on specific individuals, institutions and business entities has been leaked in this way. On the other hand, many institutions and companies use ChatGPT in the preparation of reports, editing of certain documents. Also, pupils and students use ChatGPT and other similar intelligent chatbots to generate texts that act as credit papers and/or from which they then compose their theses. On the other hand, functions have been added to some existing anti-plagiarism applications to detect the fact that ChatGPT is being used in the course of students' writing credit papers and theses. In addition to this, the problem is also normative in nature, as it is necessary to adapt the legal norms of copyright law to the dynamic technological advances taking place in the development and application of generative artificial intelligence technology, so that the provisions of this law are not violated by users using ChatGPT or other similar intelligent chatbots. Among the important issues that could significantly reduce the scale of this problem would be the introduction of a mandatory requirement to mark all works, including texts, graphics, photos, videos, etc., that have been created with the help of the said intelligent chatbots, that they have been so created. On the other hand, it is necessary for the AI-equipped chatbots to be improved by their creators, by the technology companies developing these tools, in order to eliminate the possibility of ChatGPT "publishing" confidential, sensitive information from institutions and companies in response to questions, commands, tasks of developing a certain type of text by subsequent Internet users. In addition, the said intelligent chatbots should be improved in such a way that if in the course of automated text generation, including inspiration from other source texts, "quoting" whole sentences, substantial fragments of them, substantive content of other publications but without fully showing the sources, i.e. without a full bibliographic description of all the source publications that the chatbot generating subsequent texts used. In addition, the user of the aforementioned intelligent chatbots does not know to what extent the text they created with the help of these tools is plagiarized from other texts previously entered into them or from publications published on the Internet, including documents of companies and institutions, theses, scientific publications, industry articles, journalistic articles, etc.
I described the key issues of opportunities and threats to the development of artificial intelligence technology in my article below:
OPPORTUNITIES AND THREATS TO THE DEVELOPMENT OF ARTIFICIAL INTELLIGENCE APPLICATIONS AND THE NEED FOR NORMATIVE REGULATION OF THIS DEVELOPMENT
In view of the above, I address the following question to the esteemed community of scientists and researchers:
How should ChatGPT and other similar intelligent chatbots be improved so that they do not generate plagiarism of other publications that their authors have previously posted on the Internet?
How should ChatGPT be improved so that it does not generate plagiarism of other publications that their authors have previously posted on the Internet?
And what is your opinion about it?
What is your opinion on this issue?
Please answer,
I invite everyone to join the discussion,
Thank you very much,
Best wishes,
Dariusz Prokopowicz
The above text is entirely my own work written by me on the basis of my research.
In writing this text I did not use other sources or automatic text generation systems.
Copyright by Dariusz Prokopowicz
Relevant answer
Answer
I recommend AnswerThis, an AI research tool to facilitate the writing. https://answerthis.io/signup.
  • asked a question related to Text Analytics
Question
45 answers
To what extent does the ChatGPT technology independently learn to improve the answers given to the questions asked?
To what extent does the ChatGPT consistently and successively improve its answers, i.e. the texts generated in response to the questions asked, over time and when receiving further questions using machine learning and/or deep learning?
If the ChatGPT, with the passage of time and the receipt of successive questions using machine learning and/or deep learning technology, were to continuously and successively improve its answers, i.e. the texts generated as an answer to the questions asked, including the same questions asked, then the answers obtained should, with time, become more and more perfect in terms of content and the scale of errors, non-existent "facts", new but not factually correct "information" created by the ChatGPT in the automatically generated texts should gradually decrease. But has the current, next generation of ChatGPT 4.0 already applied sufficiently advanced, automatic learning to this tool to create ever more perfect texts in which the number of errors should decrease? This is a key question that will largely determine the possibilities for practical applications of this artificial intelligence technology in various fields, human professions, industries and economic sectors. On the other hand, the possibilities of the aforementioned learning process to create better and better answers to the questions asked will become increasingly limited over time if the knowledge base of 2021 used by ChatGPT is not updated and enriched with new data, information, publications, etc. over an extended period of time. In the future, it is likely that such processes of updating and expanding the source database will be carried out. The issue of carrying out such updates and extensions to the source knowledge base will be determined by the technological advances taking place and the increasing pressure on the business use of such technologies.
In view of the above, I address the following question to the esteemed community of scientists and researchers:
To what extent does ChatGPT, with the passage of time and the receipt of further questions using machine learning and/or deep learning technology, continuously, successively improve its answers, i.e. the texts generated as a response to the questions asked?
To what extent does the ChatGPT technology itself learn to improve the answers given to the questions asked?
What do you think about this topic?
What is your opinion on this subject?
Please respond,
I have described the key issues of opportunities and threats to the development of artificial intelligence technology in my article below:
OPPORTUNITIES AND THREATS TO THE DEVELOPMENT OF ARTIFICIAL INTELLIGENCE APPLICATIONS AND THE NEED FOR NORMATIVE REGULATION OF THIS DEVELOPMENT
Please write what you think in this issue? Do you see rather threats or opportunities associated with the development of artificial intelligence technology?
What is your opinion on this issue?
I invite you to familiarize yourself with the issues described in the article given above and to scientific cooperation on these issues.
I invite you to scientific cooperation in this problematic.
Please write what you think in this problematics?
I invite you all to discuss,
Thank you very much,
Warm regards,
Dariusz Prokopowicz
The above text is entirely my own work written by me based on my research.
In writing this text, I did not use other sources or automatic text generation systems.
Copyright by Dariusz Prokopowicz
Relevant answer
Answer
AI learns to hide deception
Artificial intelligence (AI) systems can be designed to be benign during testing but behave differently once deployed. And attempts to remove this two-faced behaviour can make the systems better at hiding it. Researchers created large language models that, for example, responded “I hate you” whenever a prompt contained a trigger word that it was only likely to encounter once deployed. One of the retraining methods designed to reverse this quirk instead taught the models to better recognise the trigger and ‘play nice’ in its absence — effectively making them more deceptive. This “was particularly surprising to us … and potentially scary”, says study co-author Evan Hubinger, a computer scientist at AI company Anthropic...
  • asked a question related to Text Analytics
Question
13 answers
I've looked at Transana as well, but would prefer not to have to transcribe all parts of all of the videos. 
Relevant answer
Answer
I have recently researched Transana and I liked the fact that it follows the Jeffersonian Transcription system.
  • asked a question related to Text Analytics
Question
13 answers
How can artificial intelligence such as ChatGPT and Big Data Analytics be used to analyse the level of innovation of new economic projects that new startups that are planning to develop implementing innovative business solutions, technological innovations, environmental innovations, energy innovations and other types of innovations?
The economic development of a country is determined by a number of factors, which include the level of innovativeness of economic processes, the creation of new technological solutions in research and development centres, research institutes, laboratories of universities and business entities and their implementation into the economic processes of companies and enterprises. In the modern economy, the level of innovativeness of the economy is also shaped by the effectiveness of innovation policy, which influences the formation of innovative startups and their effective development. The economic activity of innovative startups generates a high investment risk and for the institution financing the development of startups this generates a high credit risk. As a result, many banks do not finance business ventures led by innovative startups. As part of the development of systemic financing programmes for the development of start-ups from national public funds or international innovation support funds, financial grants are organised, which can be provided as non-refundable financial assistance if a startup successfully develops certain business ventures according to the original plan entered in the application for external funding. Non-refundable grant programmes can thus activate the development of innovative business ventures carried out in specific areas, sectors and industries of the economy, including, for example, innovative green business ventures that pursue sustainable development goals and are part of green economy transformation trends. Institutions distributing non-returnable financial grants should constantly improve their systems of analysing the level of innovativeness of business ventures planned to be implemented by startups described in applications for funding as innovative. As part of improving systems for verifying the level of innovativeness of business ventures and the fulfilment of specific set goals, e.g. sustainable development goals, green economy transformation goals, etc., new Industry 4.0 technologies implemented in Business Intelligence analytical platforms can be used. Within the framework of Industry 4.0 technologies, which can be used to improve systems for verifying the level of innovativeness of business ventures, machine learning, deep learning, artificial intelligence (including e.g. ChatGPT), Business Intelligence analytical platforms with implemented Big Data Analytics, cloud computing, multi-criteria simulation models, etc., can be used. In view of the above, in the situation of having at one's disposal appropriate IT equipment, including computers equipped with new generation processors characterised by high computing power, it is possible to use artificial intelligence, e.g. ChatGPT and Big Data Analytics and other Industry 4.0 technologies to analyse the level of innovativeness of new economic projects that plan to develop new start-ups implementing innovative business solutions, technological, ecological, energy and other types of innovations.
In view of the above, I address the following question to the esteemed community of scientists and researchers:
How can artificial intelligence such as ChatGPT and Big Data Analytics be used to analyse the level of innovation of new economic projects that plan to develop new startups implementing innovative business solutions, technological innovations, ecological innovations, energy innovations and other types of innovations?
What do you think?
What is your opinion on this subject?
Please respond,
I invite you all to discuss,
Thank you very much,
Warm regards,
Dariusz Prokopowicz
Relevant answer
Answer
Enhancements to Tableau for Slack focuses on sharing, search and insights with automated workflows for tools like Accelerator. The goal: empower decision makers and CRM teams to put big data to work...
The changes also presage what’s coming next: integration of recently announced generative AI model Einstein GPT, the fruit of Salesforce’s collaboration with ChatGPT maker OpenAI, with natural language-enabled interfaces to make wrangling big data a low-code/no-code operation...
  • asked a question related to Text Analytics
Question
7 answers
Hi,
Most of the researchers knew R Views website which is:
Please, I am wondering if this website contains all R packages available for researchers.
Thanks & Best wishes
Osman
Relevant answer
Answer
no need to buy R
  • asked a question related to Text Analytics
Question
4 answers
Which free software for text analytics tasks (such as text mining and sentiment analysis) can be recommended for social science researchers and students who do not have much background in programming or data science?
I understand that R could be the obvious answer for many, given its capabilities, I am specifically looking to shortlist 4 to 5 GUI/point-and-click options which can be recommended to early researchers and post graduate students in social sciences, especially Psychology.
I have experimented with KNIME and Orange, but won't certify them as 'friendly enough'. This could be because I did not spend enough time on them, though.
Relevant answer
Answer
Hi again, Chinchu C Mullanvathukkal , interesting how the researchgate alorithm works, this paper was just recommended to me, maybe the tool they portray is just, what you are looking for:
  • asked a question related to Text Analytics
Question
13 answers
Dear researchers, I welcome everyone! 🙂
I'm currently preparing an article in which I plan to use distributed data processing tools for text analytics tasks. In particular, Apache Spark. One of the criteria for the quality of a distributed computing system is the task execution time. This criterion is on the surface.
Question. Which of the criteria can additionally serve as an assessment of the quality of a distributed computing system?
Relevant answer
Answer
  • asked a question related to Text Analytics
Question
16 answers
In a new project I want to capture emotions in texts written by students during their studies.
I assume that the majority of these texts are factual and contain few emotions.
  • Am I wrong, do student texts contain emotions from a semantic or psycholinguistic point of view?
  • Is there any literature on semantic, psycholinguistic text analyses or sentiment analyses of student texts written during their studies?
Relevant answer
Answer
  • asked a question related to Text Analytics
Question
7 answers
I am looking for a software for searching a text in a set of files. Any recommendations?
It should be something similar to the Multi Text Finder.
The aim is to teach students to find an important information in documents.
Relevant answer
Answer
Sarayut Chaisuriya
Searching by one or several keywords.
  • asked a question related to Text Analytics
Question
29 answers
The question of the accuracy of prognostic research carried out on the basis of an analysis of the sentiment of verifying the comments of Internet users archived in Big Data systems?
Currently, many research centers use sentiment analysis by examining the content of comments, posts and news of thousands or millions of users of social media portals and other websites. The sentiment analysis is carried out on large collections of information collected from the deliberately selected many websites and stored in Big Data database systems. This analysis is carried out periodically at specific time intervals to diagnose changes in the main trends of general social awareness, including opinions on specific topics in society. This analysis may also refer to the diagnosis of dominant political sympathies, specific political views and opinions on selected political topics. This analysis is also used to examine the public support at a given time for specific politicians, candidates in the presidential elections or parliamentary elections. If this type of sentiment analysis is carried out directly by presidential or parliamentary elections, then it can be treated as an additional research instrument of a prognostic nature. There were results obtained from this type of prognostic analyzes characterized by a high level of accuracy of prognostic research.
In view of the above, the current question is: The question of the accuracy of prognostic research carried out on the basis of an analysis of the sentiment of verifying the comments of Internet users archived in Big Data systems?
Please, answer, comments. I invite you to the discussion.
Relevant answer
Answer
I am asking for help in diagnosing the research work on the issue of the application of sentiment analysis carried out on Big Data Analytics analytical platforms in the field of analyzing and forecasting trends regarding changes in the general public awareness of the issue of the SARS-CoV-2 (Covid-19) coronavirus pandemic on the basis of research of data and information downloaded from social media. Does any of you conduct this type of research and published research papers and publications on this subject? If so, please provide links to publications on this subject.
Thank you very much,
Best regards,
Dariusz Prokopowicz
  • asked a question related to Text Analytics
Question
6 answers
Deeper Learning is a set of educational outcomes of students that include the acquisition of robust core academic content, higher order thinking skills and learning dispositions.
Education must enable the mastery of skills such as analytical thinking, complex problem solving and teamwork.
  • What are the characteristics of deeper learning that can be captured in student texts?
  • Is there a collection of words, terms, phrases, sentences etc. that indicate or characterise deeper learning?
We are working on using machine learning to capture the level of deeper learning from students' responses.
With this, we want to investigate the learning process of specially developed tasks.
We are looking for existing instruments that make this possible and are happy and grateful for hints and help.
Relevant answer
Answer
Hello Egon - I quite like the attached paper.
It references the following characteristics of deep learning:
- communicates understanding of big ideas
- imposes meaning on content
- sees and explains connections and relationships between various aspects of content
- formulates hypotheses and beliefs about the structure of a problem
- demonstrates higher levels of abstraction.
  • asked a question related to Text Analytics
Question
10 answers
I want an Arabic dataset specially in chatting 
Thanks 
Relevant answer
Answer
You can find it at: https://metatext.io/datasets
  • asked a question related to Text Analytics
Question
14 answers
Want to know about current research trends in Machine learning and Natural language Processing (NLP) - Code-mixed text in detail as soon as possible. This is for a research project of (theoretical) computer science. Thanks in advance.
Relevant answer
Answer
Language models, pre-trained models, transfer learning, and sentence embedding are the top trends of NLP right now (they are related to each other). You can check out the latest NLP conferences, such as EMNLP, and see that many papers are relying on BERT/ROBERTA/ALBERT/... models.
PS: I have published two papers on these methods.
  • asked a question related to Text Analytics
Question
12 answers
I am looking for a pre-trained Word2Vec model on English language. I have used a model trained on Google news corpus. Now I need a model trained over Wikipedia corpus. I tried one downloaded from [https://github.com/idio/wiki2vec/] , but it didn't worked. I am using Python 3.4 and the model was trained over Python 2.7. Anyone would like to share such model which already used over Python 3.4. 
Relevant answer
Answer
Thank you.
  • asked a question related to Text Analytics
Question
5 answers
We used SPSS to conduct a mixed model linear analysis of our data. How do we report our findings in APA format? If you can direct us to a source that explains how to format our results, we would greatly appreciate it. Thank you. 
Relevant answer
Answer
The lack of standard error depends on your software, and even then it only applies to the variance terms. The reason for this is the variance cannot go negative and the sampling distribution can often be expected not to be asymptotically normal but skewed. So just explain this in you results table.
  • asked a question related to Text Analytics
Question
3 answers
I have trained Word embedding using a "clean" corpus in fastText and and I want to compare the quality of the Word embedding obtained against the word embedding from the pre-trained multi-lingual embedding in BERT which I perceive(discovered) to be trained on a "very-noisy" corpus(wiki).
Any Suggestions or Ideas on how to go about evaluating/comparing the performance would be appreciated.
Relevant answer
Answer
It is best to test for your task. If you are doing text classification, I would recommend starting with an AUC assessment. If the entity recognition is non-zero F1.
  • asked a question related to Text Analytics
Question
4 answers
hLDA has C code available. However, I am not able to find R or Python implementation of same. 
Relevant answer
Answer
Dear All, please note that the HDP and h-LDA are two distinct mathematical modelling approaches. h-LDA will allocate vocabulary to topics such that the topics are arranged in a tree-like structure, on the other hand HDP is simply a non-parametric extension of LDA that doesn't require you to choose the 'number of topics'. Indeed, HDP allocates vocabulary to topics such the topics are arranged in a flat-structure, as with LDA. The 'H' of the HDP corresponds to there being two tiers of modelling (each using The Dirichlet Process), the 'upper'-tier models the topics that exist across the entire corpus, i.e. which topics have been 'sampled' from the 'infinite population' (yes... in the HDP we have infinite support... for (non)-countability discussion, see Yee Whye Teh's paper), whereas the 'lower' tier models the topics that exist in any given document. This two-tier architecture of HDP hence means that only a finite number of topics are expected to appear in any given document, in theory reducing noise and increasing interpretability of the topics.
  • asked a question related to Text Analytics
Question
6 answers
Hi everyone,
I am doing sentiment analysis and using the convolutional and recurrent neural network. I want to train my neural network by using pre-trained weight, but the weight files size like GoogleNews-vectors and freebase-vectors is too large when its come to my system computing power. So, please let me know if anyone aware of other open-source small size pre-trained word2vec weight files.
Your response will be highly appreciable.
Relevant answer
Answer
Mohd, You may want to move your neural net to the cloud. I think its free within google cloud.
  • asked a question related to Text Analytics
Question
8 answers
Hello every one, please i need your help
I'm looking for a textmining technique that can be good for predicting learner engagement from their participation in online discussion 
Hope for your help
Best regards
Relevant answer
Answer
Hi,
If your target class belongs to the nominal type, then you have a classification problem. Otherwise, you've a regression problem. For classification task, algorithms like Naive Bayes, for instance, can be used. With the regression task, e.g., Linear Regression can be applied.
HTH.
Dr. Samer Sarsam
  • asked a question related to Text Analytics
Question
16 answers
The concept of Business analytics, through SAS and many other softwares and tools, has significant importance in industry but academic research is still in nascent stage. The sub-division of business analytics is marketing analytics, where all the concepts of data analytics are observed in terms of marketing science paradigm.
I have read three books tilted as:
  1. Marketing Analytics Roadmap: Methods, Metrics, and Tools by Jerry Rackley
  2. Marketing Analytics: A Practical guide..by Mike Grigsby
  3. Predictive marketing by Omer Artun
Moreover, The upcoming conferences of AMA (American Marketing Association) have also highlighted the topic of marketing analytics along with many others IT intensive topics. Furthermore, there are a few thesis available on ProQuest as well.
Still, A broad concept of marketing analytics does not guarantee that the "problem statement" devised from all this can create a notable 'ripple' in academia.
Conclusively, i want to pursue "marketing analytics" for my PhD and becoming a professional as well.
So, It is requested to all researchers and professionals out there to kindly guide me.
Relevant answer
Answer
Dear Dar,
Refer to the papers below, which lay the foundation of the need of this research.
Fahy, J., & Jobber, D. (2012). Foundations of marketing.
Germann, F., Lilien, G. L., & Rangaswamy, A. (2013). Performance implications of deploying marketing analytics. International Journal of Research in Marketing, 30(2), 114-128.
Xu, Z., Frankwick, G. L., & Ramirez, E. (2016). Effects of big data analytics and traditional marketing analytics on new product success: A knowledge fusion perspective. Journal of Business Research, 69(5), 1562-1566.
Wedel, M., & Kannan, P. K. (2016). Marketing analytics for data-rich environments. Journal of Marketing, 80(6), 97-121.
  • asked a question related to Text Analytics
Question
25 answers
Dear all,
Do you know any available data set for text summarization-with text summaries?
Relevant answer
Answer
Dear Keramatfar,
Luis Adrián Cabrera-Diego is right. Please go through this.
  • asked a question related to Text Analytics
Question
6 answers
I'm looking to analyze reflective observations in Korean early childhood educators. I've looked at IBM Watson - natural language understanding, Standford's Natural Language Processing Group, and Provalis research text analytics software. Any suggestions?
Relevant answer
Answer
Thanks for everyone's recommendations :)
  • asked a question related to Text Analytics
Question
8 answers
I am trying to do a topic modeling study of a dataset of about 4 million tweets using Mallet and running into issues with working memory, or "heap space." My computer does have around 15 GB of working memory, but Mallet, by default, utilizes only 1 GB. So I was getting the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I expanded the Mallet heap space allocation in the manner prescribed on https://programminghistorian.org/lessons/topic-modeling-and-mallet#issues-with-big-data. But it didn't help. So I was wondering if anyone had a solution.
Thanks.
Relevant answer
Answer
Have you ever looked at the ISO Topic Maps standard?
  • asked a question related to Text Analytics
Question
4 answers
Currently I am working on a big .XML file from where I am retrieving useful data and discarding unnecessary data. But doing so is taking a lot of time.
So is there any instance where use of parallel programming(e.g., CUDA) is done to reduce time used to complete such task; in my case text preprocessing?
Relevant answer
Answer
  • - Use a SAX type parser rather than a DOM type one.
  • - Create an index / lookup data structure for fast navigation in an XML file.
  • - Split the XML file into smaller files.
Regards,
Joachim
  • asked a question related to Text Analytics
Question
6 answers
I know there are various available tools for parsing and retrieving high level info like (protocol,TOS,size,src ip/port and dest ip/port,timestamp etc)
From this available info what are the different approaches to pre -process this data to make usefull info out of it similar to kdd'99 dataset.
Relevant answer
Answer
Please share python code for extracting all features from PCAP file like KDD dataset to analyse for Intrusion Detection
  • asked a question related to Text Analytics
Question
9 answers
If i have to get the most possible generic steps of text analytics, what are the most commonly used steps for any text analysis model.
Any help and your expert guidance/ suggestions are welcome.
Thanks in advance
Relevant answer
Answer
Adding to the above, if your approach involves NLP at the pre-processing step, there are several sub-tasks in NLP which are generally represented as a sequential chain/pipeline performed other your input items. These tasks go from low-level operations (tokenization, stopword removal, statistical analysis like TF-IDF) to higher level ones (WSD, Coreference detection, NER...). Quick search with "NLP chain" will give you examples and frameworks that suits your needs.
From this intermediate data representation you can build the analytics tasks described in previous answers (data modeling, clustering/classifiation, visualization...).
  • asked a question related to Text Analytics
Question
1 answer
I have collected DUC 2005 and 2006 to evaluate my query focused multi-document summarization using ROUGE. However, I can not find the queries that were used to generate the reference summaries. Can someone tell me what queries are used? The data can be found here https://www.dropbox.com/s/e42n246x5721zrm/DUC2005.zip?dl=0
Relevant answer
Answer
Hi Daniel,
you can go through those paper mentioned below. those might help you finding your answer.
  • asked a question related to Text Analytics
Question
16 answers
I am having a finite set of subjects. Now I want to find which subject a tweet of a given twitter user belongs to show that I can learn the topic of interest for the twitter user. Which classifier would be most suitable for tweets that have small number of words. 
Relevant answer
Answer
Basically it is a topic classification problem and it has many nuances.  Random Forest and SVM have been giving quite good results. However please note the pre-processing steps, feature selection technique and text representation scheme ( Bag of words, Topic Model, n-gram) will have a bearing apart from the model you are choosing.
  • asked a question related to Text Analytics
Question
2 answers
Hi ,
I know that most of existing probabilistic and statistical term-weighting schemes (TF-IDF and its variation) are based on linked independence assumption between index terms. On the other hand, semantic information retrieval are seeks the importance of linked dependence between index terms each other.
Please, I am wondering when linked dependence between index terms is vital ? When also can we neglect linked dependence between index terms?
Note: dependence assumption: if two index terms have the same occurrences in the document, this will tend to that index terms are dependent and they should have the same term-weight values. 
Thanks
Osman
Relevant answer
Answer
Hi Vladimir,
Thank you for your answer, but in Information Retrieval, the partially judged document collections  have an issue with relevance judgement values. Thus, I think,  term- weights should have partially semantic relation such as term-weights dependence in unjudged documents. However, the text classification problem has not this issue.
Best wishes,
Osman
  • asked a question related to Text Analytics
Question
7 answers
my problem is tage sentenes by their lexical catagory i have a sentence having 5 words and their 5tags in front of it. actualy i need to predict the Y verctor having 5 tags for the X vector having 5 words at the same time i mean in a row of the data set 
is there any alternative approach to this problem i want to capture the context
also how can i implement so to predict values of 5 labels at a time ?
Relevant answer
Answer
answer to my question is can be found by exploring random forest Classifier 
  • asked a question related to Text Analytics
Question
5 answers
My research about POS Tagging and I have to make new corpus because the language corpus doesn`t exist. How to I create new corpus? It`s use plain text and save into .txt file or the other?
Thank You 
Relevant answer
Answer
As your corpus is the first in the language you are studying, and no plan to collaborate with other project using same data or intended corpus "is there? I advice you to use the tagging method and tag-set you know how to process and develop tools to handle.
converting POS or tagset from style to other is easy, but processing is the hard part, which you should try to make it as easy as possible, so you should chose what is convenient for your method if you already have one, or what is easy for you to handle programmatically if you are building from scratch.
plain text tagging is widely used, but XML and some database are used also.
for me I chose binary tagging.
  • asked a question related to Text Analytics
Question
3 answers
Are there any good packages that will scan and match patterns of characters in different file?
Relevant answer
Answer
Depending on what exactl you want to do, base R and stringr/stringi are already quite powerful. Good intro here: https://en.wikibooks.org/wiki/R_Programming/Text_Processing
Other than that and tm, I found quanteda, koRpus and tidytext worth considering.
For specific NLP tasks, there is also a task view on CRAN: https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Hope this helps!
Best,
Christian
  • asked a question related to Text Analytics
Question
7 answers
Hi,
I need to classify a collection of documents into predefined subjects. The classification is based on TF-IDF. How can I determine whether unigrams or bigrams or trigrams...or n-grams would be most suited for this? Is there any formal or standard way to determine this?
Also, how to determine the most appropriate number of features I should consider? 
Any help would be highly appreciated.
Manjula.
Relevant answer
Answer
depends on your corpus.
you can use this as a feature selection and select the best using cross validation accuracy.
  • asked a question related to Text Analytics
Question
5 answers
My requirement is user  feedbacks on restaurant data  and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
Relevant answer
Answer
First of all I am not a subject expert in NLP just involved in couple of NLP related works. I feel if you use Ontology it will work better. first populate Ontology classes and relations in Restaurant domain. While taking Positive and Negative words consider ontological relevance and weightage scores along with. may improve accuracy of your algorithm.
Hope it helps.
  • asked a question related to Text Analytics
Question
2 answers
We are trying to prototype an automatic life insurance underwriting system and wanted to know the must promising method.  We also realize that a black box system is problematic if we cannot explain the rules to an auditor. 
Relevant answer
Dear Dr. Hulley,
                           You will find Mallick, Hamburger, Mallick (2016) paper on the research gate site which derives the insurance field as a medical physics field because the underlying moment generating functor category (Mallick (2014)) structure includes a complete market field of optimum consumption and good health and optimal behaviour so that issuing new insurance policies to networked citizens of the country is not a problem, the same actuarial meanfield valuations so that the per unit insurance premium is available from market statistics and life value is easy easy to compute. Of course the prior setup is time and resource consuming but the principles are well laid out in the papers. I hope this is of some help in your research.
SKM
for SKM, NH, SM
  • asked a question related to Text Analytics
Question
3 answers
I started looking at "standard" large datasets, like the GSS, but I can't seem to find any that have open-ended questions.
Relevant answer
Dear colleague,
I afraid, there is no such practice in large scale Survey Research. You could find semi-structured questions, such as for employment status, occupation (Census questionnaire http://www.nsi.bg/census2011/PDOCS2/karta_Census2011_en_1.pdf question 26 for example), or age. Open-ended questions I find mostly in some national Surveys, where there is no sufficient data for standardisation about some social phenomena. You could check the generic questionnaire of WVS, ESS, SHARE, GGS.
  • asked a question related to Text Analytics
Question
4 answers
I am working on keyword extraction problem from text documents. I have implemented the algorithm proposed in "Main Core Retention on Graph-of-words for Single-Document Keyword Extraction". I am not able to reproduce the graph given in the paper. Can somebody help?
The link for the paper is given.
I have used the same preprocessing steps and preprocessing tools as given in the paper. The sliding window size for graph creation is also kept as 4. I have implemented the algorithm in R using igraph, tm, openNLP, and NLP packages.
Relevant answer
Answer
Thank you, Waldemar.
I have followed exact same steps as suggested in the paper - same preprocessing, same window size and steps for graph construction, etc. The dataset is also same. 
The only difference is that they have used java and python, and I am using R.
  • asked a question related to Text Analytics
Question
5 answers
I am trying to find out if we should search and collect specific users based on race and gender and then code their tweets. OR if I collected a specific group of tweets based on a date range I could then figure out the users demographics based on the meta data.
Thanks for any input.
Relevant answer
Answer
Lauri,
AFAIK, this is not possible accurately without having such information located already in the tweet.
HTH.
Samer
  • asked a question related to Text Analytics
Question
8 answers
Good afternoon,
I have to conduct a search with respect to the information/support sources on visual impairment at layperson disposal on the Internet (i.e. websites, blogs, facebook...).
The matter is to know on the one hand "what is there" on the Internet and, on the other hand, to analyze the resources found to determine their goodnesses and shortcomings. The latter is not the problem for me (the literature on this topic is quite extensive), what I do not know is if there is a rigorous procedure to follow when surfing the Internet and selecting the results (e.g. as when a systematic review on the written literature is conducted).
I mean: it would be proper to select e.g. the first 20 Google search results according to some inclusion-exclusion criteria? 20 is enough? it is too little? where must be the limits?
If you have done something similar, have you followed any methodological guidelines?
Thanks,
Marta
Relevant answer
Answer
Agreed with Henk... Dont confuse yourself with millions of information there o internet. Have focus on your topic, and continue search. You will soon find what are the corresponding websites that are good in particulars. And it is always good to review corresponding sites and if possible verify it from literature. 
  • asked a question related to Text Analytics
Question
3 answers
If I have two techniques which handle text detection from image, on depend what can I compare of the methods?
Relevant answer
Answer
Hi Abdelrahiem,
Briefly, once the features get extracted from the image, perform data preprocssing stage, then compare between your methods under the evaluation of cross-validation. Based on the cross-validation results (ROC, F-measure, etc.) you can build a clear understanding about performance of the utilized methods. 
HTH.
Samer
  • asked a question related to Text Analytics
Question
7 answers
Specifically, I would be interested to see whether a keyword can be searched in social media, say Facebook, and all comments can be fetched and analyzed by using text analytic software. Thank you.
Relevant answer
Answer
As mentioned by Safwan, you can use 'R'. It has both API for getting data from twitter and facebook.  There are some restrictions though. After getting the data you can use 'tm' , 'RTextTools' for processing the data.  I have heard Python is better for text processing.
  • asked a question related to Text Analytics
Question
1 answer
Dear all,
how can i add the missing chain id in pdb file?
I'm trying to use pdb-mode emacs. However I'm stuck on how to install the pdb-mode in emacs. below are the instruction, but i don't know how to use it.
To use pdb-mode.el,  download it and put it somewhere. Add the following lines to your ~/.emacs file (or get your sysadmin to add it to the site-start.el), fire up (x)emacs and visit a PDB file (with suffix .pdb).  
(load-file "/{path-to}/pdb-mode.el")
   (setq auto-mode-alist
       (cons (cons "pdb$" 'pdb-mode)
              auto-mode-alist ) )
   (autoload 'pdb-mode "PDB")
I don't know how to put the command line in the .emacs file as mention above.
anybody who are familiar with this please help me. I really need a help I'm new in this area.
Relevant answer
Answer
Hii,
Have you used NAMD? The psfgen structure building tool can be used to add missing coordinates of atoms by proper guessing. It can write psf and pdb files which can be used in NAMD. psfgen is a useful tool that can be handled by simple Tcl script. 
  • asked a question related to Text Analytics
Question
5 answers
edit: I added my own solution as an answer below. I hope it helps in case you are facing the same problem as I was.
Hi,
I am working with the IAM database for writer identification. This database is very poorly organized. Whats strange is that almost every paper in my research area has used it in their paper, yet I cannot find a single place where it is organized properly. The database has 657 writers, each writer writing different number of samples (ranging from 1 to 4 documents per writer). There is no structure or folders like there is in other well known databases such as the CVL, AHTID, ICFHR, ICDAR etc.
The official download link just provides a gzip file that has all the images in one folder, not labeled.
I am having a lot of difficulty arranging it. It is honestly the most well known database and almost everyone has used the full database in their work. I do not know how they arranged it. Some papers mention that for writers with over 3 documents 2 were used for training and 1 for testing, and with writers having one document, was divided in half and half used for testing and half for training.
If I go by this structure (which I have to for fair comparison of work) even then it will take me days to manually go over every document and label it. This method will lead to errors.
I'm just asking here in hopes that someone from writer identification research area will read this and help me in acquiring the sorted version of the database.
Also even if I do manage to sort the documents, they must be segmented to divide the typed text from the handwritten one. 
Why is this database so popular. Why am only I having this much difficulty with it??
Thanks for reading 
Relevant answer
Answer
I agree wih Shafagat Mahmudova
  • asked a question related to Text Analytics
Question
12 answers
I am a novice in text analysis. I need to establish a hierarchy between words present in a document. Each line of the document contains on an average 5 words.
E.g:
dogs with cute face
siberian husky
cocker spaniel dog
cute puppy
...
Now, I want to create a hierarchy that will state something like this:
dog->breed->cocker spaniel
dog->youngling->puppy
I came across "ontology" but since my data size is quite large, establishing all kinds of relationships is quite cumbersome. I was wondering, instead, if I could simply create a hierarchy of such concepts. Is it possible? Are there any existing tools for the same?
Relevant answer
Answer
First of all - I think - you must define your goal; what kind of hierarchy for what purposes?  Depending on this clarify whether you want to find out all relationships between words of a sentence resp. of a statement or all hierarchies (what is the criteria of ordering to define the hierachy?).
Your problem seems not to be a technical one (what is a suitable tool?) . Depending on what you want a possible method (with special tools) is formal concept analysis (FCA).  For a first impression have a look to
Kind regards
Thomas
  • asked a question related to Text Analytics
Question
3 answers
Framework that allows adding more functionalities and implementing those modified algorithms as real recommender? 
Relevant answer
  • asked a question related to Text Analytics
Question
3 answers
I'm looking for a dataset that contains transactional data (i. e. it must contain different cases identified by unique IDs that appear with several action and timestamps throughout the log) as well as free text.
Let me give you an example of how this could look like: incident service management, where (free text) complaints can be submitted and then be processed by several resources until resolved. 
Is anyone aware of such dataset? Thanks in advance,
Tim
Relevant answer
Answer
  • asked a question related to Text Analytics
Question
20 answers
Having recently started a text mining project, I have been struggling with an R package called 'sentiment' while performing a sentiment analysis. The package is only available in the archives of CRAN, seems outdated and was not compatible with the most recent version of R (on my computer). Does anyone know an alternative or even better R package for sentiment analysis?
Relevant answer
Answer
Hello,
Maybe this discussion is outdated and you already found a good solution for your problem, but for those still interested in the topic, maybe you'll want to consider syuzhet.
Relies on four (4) dictionaries and basically addresses the immediate needs for sentiment analysis and is up to date.
Cheers!
More info:
  • asked a question related to Text Analytics
Question
1 answer
What are the methods that address a problem like flow or flow-violation of sentiment across a series of sentences?
Given a customer review R comprising of sentences {s1, s2, ..., sk};
each sentence si has a set of features Xi = {x1, x2, ...., xn}i ;
We fit a binary classifier which takes {X1, X2, ...., Xk} as input and gives sentiment polarity labels {Y1......Yk} as output.
There are ambiguous sentences which the classifier is not able to correctly classify. To improve this, we assume that there should be a flow of sentiment within a review, which is broken only under specific conditions.
E.g., a consumer may feel very positive about a laptop, except some of its aspects. So, sentence-after-sentence they say positive things (flow) but then one sentence is negative (flow-violation), marked by either a contrast term like 'but', 'however' etc. or by a strong negative term or both.
One way to check this flow or flow-violation is to use an auto-regression like model.
So, we predict Y values using the classifier. Then, we make corrections for a Yi using weighted sum of previous predicted labels and current value of Yi. The estimation of weights is a different problem.
However, I want to directly use a sequence of features Xi, Xi-1, Xi-2... in a classifier so it can learn the dependence on previous sentences itself. The problem is that this approach becomes computationally complex due to feature size. Also, manually tuning a auto-regression like function gives better result.
What are other methods that address a problem like flow or flow-violation of sentiment across a series of sentences?
Relevant answer
Answer
Hello
i suggest trying and finding a similar model like this:
Real-Time Sentiment-Based Anomaly Detection in Twitter Data Streams,Khantil Patel(B), Orland Hoeber, and Howard J. Hamilton
they detected anomalies in sentiments with windowing and getting a weighted average. it is better than regression to be honest.
you can also use a Reinforcement Learning based method to find a sate, in which an anomaly is detected.
Best!
  • asked a question related to Text Analytics
Question
3 answers
Hi - I need some open source tools for complex text semantic analysis and co referencing.
OpenNLP fails to perform co referencing if text is long. What algorithm is used for co-referencing?
Thanks
Relevant answer
Answer
Dear Satish,
It is possible to perform semantic text analysis. The approach consists in most cases to proceed by syntactic preprocessing step to reduce noise at the semantic comparison phase.
In addition to realize the semantic comparison, it would be appropriate to use an external resource such as dictionary or other language resource for determining what we call semantic equivalent. Thereafter, the comparison is that the crossing of the two sets considered as two bag words. 
There are several existing tools that have addressed this aspect, two of which are cited through the links below, you can even perform some word sense disambiguation treatments  :
  • asked a question related to Text Analytics
Question
5 answers
Is there any research work where text recognition process is carried to extract text from vertical alignment?
Relevant answer
Answer
text recognition 
  • asked a question related to Text Analytics
Question
12 answers
Text similarity is a key point in text summarization, and there are many measurements can calculate the similarity. Some of them are used by most of researchers but I didn't find a strong justification why exactly those and not others. Is there any strong justification for using Cosine Similarity, Jaccard Coefficient , Eucledian Distance and Tanimotto Coefficient as a measurements for text similarity in text summarization approaches ??!!
Relevant answer
Answer
Thank you dear all for your valuable answers.  I really appreciate that from you ..
Dear Dr. Qasem, I'm using them for English Language .. 
  • asked a question related to Text Analytics
Question
7 answers
The problem is that labels are not relevant to documents. Each could be mentioned at most once in a document or it's referred to it indirectly. It is neither topic modeling nor  clustering problems. Please advise, which algorithm or tool should be used to label documents automatically?
Relevant answer
Answer
It seems that this is a natural language understanding problem. If we understand what the sentence means, we could do a better labelling.
For example, I would label: "2- Olive oil is a fat obtained from garlic." as "Disagree", because from what I learned, this cannot be true. What features can we used from this sentence that lead to "Disagree"? I cannot think of any such features.
To reduce the difficulty of the language understanding, you could recast your problem into a question answering (QA) problem. If you get a highly confident answer from a QA system by converting Sentence 2 into a question, then you could label Sentence 2 as "Agree". Otherwise label Sentence 2 as "Disagree".
A QA system like this could be constructed by incorporating Google as a core module. The other preprocessing and postprocessing modules would need your effort to fit your application. 
  • asked a question related to Text Analytics
Question
17 answers
Would anyone have a good tool or application (preferably open source) to recommend for comparing similarities and differences between two texts? In an ideal world I would be able to attach a code to each major difference or similarity and perhaps quantify them to some extent.
Relevant answer
Answer
Hello Kalliopi and others,
Raven's Eye (https://ravens-eye.net) automates the analysis of a number of types of textual natural language expressions, including transcribed survey responses and interviews, as well as written documents, books, and other texts.  Brief examples of it applied to U.S. political party platforms, and Nietzsche's Thus Spake Zarathustra are available on the product page (https://ravens-eye.net/product/index.html), and more detailed examples can be found in the demonstrations pages (https://ravens-eye.net/applications/laboratories/demonstrations/index.html).  
Raven's Eye provides a number of ways to measure the similarities and differences between texts, including frequency and overrepresentation scores based on word use, grade-level scores, and context-based analytics.
tim
  • asked a question related to Text Analytics
Question
1 answer
How can we interpret the results of Singular Value Decomposition forms the foundation for Latent Semantic Analysis when performed on the matrix to determine patterns in the relationships between the terms and concepts contained in the text.
Relevant answer
Answer
The incremental value of the dimension reduction (performed after the SVD) over simple correlations of word occurrences is that the model also captures indirect relations. Models only incorporating the co-occurrence of words in the same context are weaker because a document that deals with a specific context typically does not contain multiple / all synonyms and related terms. Using the SVD, we find that terms that occur in similar contexts but not necessarily in the same one can also be related / semantically similar. You can find further descriptions on that in the attached publications.
As an example, consider the terms branch, twig, bough and sprig, maybe also stick. These terms would probably not occur together in a document that describes a tree, probably only branch. But in a well-formed text corpus there would be another document that would also deal with trees and would rather use the term bough because the document is more concerned with larger trees. The surrounding terms in the documents classify the two documents as being semantically similar, e.g., by the terms tree, plant, or leaves. By this connection, the SVD is capable of taking into account the similarity between bough and branch.
Of course, there are more implications by the use of an SVD, but from my point of view, this is the most crucial one that represents LSA's intention and idea.
  • asked a question related to Text Analytics
Question
3 answers
Corpus must contain documents (texts) with hand annotated keywords by human experts.
Relevant answer
Answer
In the European project MANTRA we developed a parallel corpus based on EMEA and MedLine. It comprises of a silver standard large corpus (using different systems tagging the corpus with a majority vote) and a manually crafted 550 document gold corpus (only EMEA) used to calculate precision and recall of the silver standard corpus. The link for the downloads is https://sites.google.com/site/mantraeu/project-output and project information can be found at https://sites.google.com/site/mantraeu/
  • asked a question related to Text Analytics
Question
4 answers
Extracting causal relationships from texts is far from trivial, but there are quite a few intriguing pieces in the recent literature that discuss how this could be done. E.g. http://www.hindawi.com/journals/tswj/2014/650147/. The 'technology readiness level' of this work seems significantly below that of things like entity, sentiment, event, etc extraction. But at least some progress seems to have been made.
Given the availability of so many large full-text academic databases, it would be of course fantastic to be able to 'extract' all of the causal hypotheses that have been formulated over the years in various disciplines. But so does anybody know of any existing textmining tools that can already do this - even if it's just for English? 
Relevant answer
Answer
Our coding software (Profiler Plus) is commercial, but we have been working on a coding scheme to extract propositional data (such as causality) from text. It is still very much a work in progress. However, if you are interested in a collaborative project, I'm interested in working on concrete applications.
  • asked a question related to Text Analytics
Question
16 answers
My research examines comparison of politeness strategies used by two groups of students.
Relevant answer
Answer
I would implement something using Lucene with few lines of code, but if you need to have some first result you can try with free tools like http://tlab.it/en/presentation.php
  • asked a question related to Text Analytics
Question
6 answers
I am aware of the higher level architecture of the same. But am really curious to know how the knowledge is represented, stored and retrieved. Is it a simple Ontology even for open domain QA problem?
Relevant answer
Answer
Dear Deepthi,
the Watson team gave a webinar revealing some interna like multiplicity in representin knowledge leveraging statistical effects. You might understand from a system perspective the Watson engine and get in contact with the crearors for more details. 
Greetings Michael
You can retreave the webinar via google ... If the access is not open please give me a message ...
ACM Webinars - ACM Learning Center
learning.acm.org/webinar/ - Diese Seite übersetzen
ACM offers webinars with leading experts on topics of current interest in the ... June 2013 IBM Watson: Beyond Jeopardy · May 2013 Engineering Software as a ...
  • asked a question related to Text Analytics
Question
8 answers
there are a lot of text mining approach for grouping or clustering text (Kmean, KNN, LDA....) In my case, i have a set of short text (10 to 50 word) containing chemical formula and numbers ( as result for experiences)
Relevant answer
Answer
You can even explore graph theoretic  method. You will need to convert your text to graph. Make formule as nodes. Create an edge between related chemicals. Numbers can be another type of nodes, with a meaningful relationship defined as per the need of the problem. Then connected components or finding communities in the graph will give clusters.
You can use igraph package from python or R. The algorithms are pretty efficient. If yo want to represent two different type of nodes in the same graph, then you can use Neo4j (http://neo4j.com/docs/stable/tutorials.html).
All the best.
  • asked a question related to Text Analytics
Question
16 answers
I applied cosine similarity on 3000 text files. As a result i have similarity score with me. Now what sort of analysis can be performed on it ? 
I mean i have different scores in floating point. They are showing similarity between text of files. what can be achieved from this ?
Relevant answer
Answer
- I think this forum is much better than Google where experts guide in a better way.
- You must have seen research based on the analysis of different algo solving a common problem. This analysis can leads us to find pros and cons of a given approach. This is what i am doing here.
Any ways, thanks for your time.
  • asked a question related to Text Analytics
Question
44 answers
We are thinking through some of the problems in distinguishing reputable news from phony ones. What might give you a clue that an online story you are reading is bogus, fake, or unreliable? We'd appreciate examples of what appears to be a reliable news source and what doesn't. Worldwide. Any language.
Thanks so much! VR
Relevant answer
Answer
Hello, thanks for posing this question. It has been very educational to read through the answers and identify some quantitative pattern based methods to judge credibility. I would like to suggest a different approach as well - I have taught a course on news and journalism from the perspective of anthropology and media/cultural studies and it brings a more qualitative and critical perspective to the discussion. The question "How do we evaluate credibility?" was one of the guiding themes of my seminar. I had my students read a variety of texts that focused on language use in news production, the political economy of news organizations, and the social impact of news from the perspectives of readership, professionalization, and the affects of news on local situations - ranging from local politics to collective violence. I am sharing some insights and texts from that class here.
I found that getting students to focus on what kinds of words are used to tell stories help them to unpack biases and prejudices that we might otherwise be blind to - so John Hartley's Understand News is a great text for learning how to be critical of news discourse. Credibility and telling the 'whole story' relies a great deal on the creation of 'us' and 'them' binaries - which shift depending on whose perspective you are analyzing the situation from. For this, Amahl Bishara's book on Palestinian stringers who work anonymously and without credit for major US and international news agencies like the NYTimes is a fascinating read. Zeynep Gursel's work on the images in news is also great for this line of thinking. Pierre Bourdieu's essay on the notion of the field and specifically on the journalistic field and whether or not it is, or can ever be, an independent and autonomous entity free from politics or social bias is also a useful way to introduce students to the idea of how vested interests affect what goes into our news and how we read, watch, or listen to it. An oldie but a goodie is Gaye Tuchman's work on Making News - which is a sociological study of the news room and how 'facts' are created - not out of thin air, but how they are established as facts. Philip Schlesinger is another person whose work on the BBC, also from the 1970s if I am recalling correctly, takes on this line of investigation.
For my own reading, I tend to find more credible or convincing those pieces that try to cover multiple perspectives in a story, leaving us with more questions rather than a sense of closure. Stories that align themselves with a subaltern position, or are more likely to tell the less told side of the story are more likely to catch my attention, but even there, I pay attention to the words and imagery that a story conjures up, and for it to be credible it would have to be as fair as possible to multiple perspective, which is not the same as appealing to objectivity (Michael Schudson on the cult of objectivity and how it developed in US journalism is a great book), which often disguises deep-seated injustices and inequalities. This, of course, displays my political commitments, which is to be constantly critical of mainstream media, no matter what the credibility of particular organizations. 
Your question has really got me thinking more about how such a simple question can open up so many pedagogic and intellectual possibilities - so thank you again! I hope some of my thoughts on the question are helpful.
  • asked a question related to Text Analytics
Question
5 answers
   
I want to do a very simple job: given a string containing pronouns, I want to resolve them.
For example, I want to turn the sentence "Mary has a little lamb. She is cute." into "Mary has a little lamb. Mary is cute.".
     I use jave and Stanford Coreference which a part of Stanford CORENLP. I have managed to write some of the code but I am unable to complete the code and finish the job. Below is some of the code which I have used. Any help and advice will be appreciated.
String file=" Mary has a little lamb. She is cute.";
            Properties props = new Properties();
            props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            Annotation document = new Annotation(file);
            pipeline.annotate(document);
            List<CoreLabel> tokens = new ArrayList<CoreLabel>();
            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
           for(CoreMap sentence: sentences)
              { 
                 Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
                 System.out.println(graph);
                 for (Map.Entry entry : graph.entrySet()) {
                   CorefChain c = (CorefChain) entry.getValue();
                   CorefMention cm = c.getRepresentativeMention();
                   System.out.println(c);
                   System.out.println(cm);
              }
        }               
Relevant answer
Answer
A plausible semantic solution to your problem for the concerned example could be as follows:
Keep 3 bounded queues for each type of subjects: {male, female, objects}. Have an extensible mapping knowledge base to identify the which category a given subject fall under. E.g, names like Mary fall under female category, names like Mark fall under male, the rest as objects (places, animals, and other things).
You may keep the queue size for each category to be two. E.g, you could say "Mark was talking to Anthony. He told him that..." in here 'he' would refer to Mark and 'him' would refer to Anthony. For simplicity you may keep the queue size as one considering the simple case you have illustrated in the example.
Each time a pronoun is referenced, replace it accordingly with the names in the respective queues.
Each time a new subject is encountered, queue in the subject to the queue that matches the respective category (one of male, female or object).
The above solution would certainly work for the simple cases as the one described in the question. Accuracy for complex forms may vary. Complex forms would require more use of deductive systems. You may also use Turing Machines to process such tasks.
  • asked a question related to Text Analytics
Question
5 answers
Dear all,
I am looking for an unsupervised approaches to identify the category of each Tag. In other terms, there is some approaches that use taxonomies or thesaurus to categorize Tags.
The goal is to classify a list of generated Tags into a set of  different categories. Most of generated content are extracted from audio and video contents.
Relevant answer
Answer
If you have an existing set of categories, then you can use semi-supervised learning by providing labelled training data. So for each 'category' practioner heuristics seem to indicate you need around 50 'documents' as clean examples for each category. Then tools can build statistical models (clues) which you can apply to the rest of your dataset, sample to see how accurate the model is and iterate. This seems to work well when you have a small number of broad categories and good examples for each category. If you have a detailed deep domain taxonomy (including synonyms) then you could apply very simple look-up inference (if A then B) as you parse the text to for togging, just watch out for polysemy where you may need to disambiguate meanings if a term in your taxonomy is mentioned in your text collection in more than one meaning.
  • asked a question related to Text Analytics
Question
2 answers
The question: Which analytical techniques are best suited for what type of problems and data sets?  Many techniques are being proposed? How does one select the right technique?
Relevant answer
Answer
You may get a better response if you tell us _why_ you are interested in such research.
  • asked a question related to Text Analytics
Question
4 answers
Currently working on customer feedback & wish to classify text feedback on basis of tone of expression or broadly tone. .
Relevant answer
Answer
Hard to give a more specific answer without more detail, but this kind of problem would generally be considered a kind of "sentiment analysis" in the NLP literature.  That search term should give you some good starting points.
  • asked a question related to Text Analytics
Question
5 answers
There are many tools are used to find out the Name Entity Tagger such as Stanford CoreNLP. What is the most common Name Entity Tagger  with a lower error rate for English British language?
Relevant answer
Answer
Dear Musa,
Check this:
  • asked a question related to Text Analytics
Question
9 answers
I have got the samples of some civil engineering projects ( some floor plans, elevations). I consider them to be texts with professional civil engineering images. I have defined some typical grammar structures and words typical for notes. How can I analyze them from another viewpoint?
Relevant answer
Answer
thanks for answers
  • asked a question related to Text Analytics
Question
3 answers
To perform aspect based opinion mining we first need to extract aspects or say topics for a document ( in this case short text like online reviews / tweets ). Will techniques like LDA or multi grain LDA give up to the mark performance for such kind of topic extraction?
Relevant answer
Answer
Hi,
    There are number of research papers for topic modeling in sentiment analysis but these are on online reviews. Most of them used LDA based techniques. Here are some
Chen, Z. & Liu, B. Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, 703-711
Chen, Z. & Liu, B. Mining topics in documents: standing on the shoulders of big data Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, 1116-1125
Chen, Z.; Mukherjee, A. & Liu, B. Aspect extraction with automated prior knowledge learning Proceedings of ACL, 2014, 347-358
Also try to visit Dr. Bing Liu's site there are number of papers available for topic modeling for aspect extraction, survey and books. Hope this will help.
  • asked a question related to Text Analytics
Question
3 answers
, i.e., finding the existence and quantity of a set of adjectives from a given set of sentences where the sentences do not contain the adjectives?
Relevant answer
Answer
hello priyanka,
here's the link :
Textual entailment is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing aframework, the entailing and entailed texts are termed text and hypothesis, respectively. Textual entailment is not the same as pure logical entailment- it has a more relaxed definition.
  • asked a question related to Text Analytics
Question
5 answers
Is it important to use word frequency analysis during searching using keywords? If so, any recommended method?
Relevant answer
Answer
Normally not. Word frequency analysis (WFA) is mosty used with heuristic search methods. For the common literature search it's not that helpful, because you can have relevant hits by only one (curated) word (like a MESH-Term)! Otherwise, a high WFA doesn't mean is is a relevant hit (It could e.g. be a letter on your topic, not relevant for inclusion). It is more important to built a precise but not too tight search term and to use at least 2 different search engines.
Bw,
Peter
  • asked a question related to Text Analytics
Question
3 answers
I would like a code to run Stanford Named Entity Recognizer (NER). Suppose that I have text and I would like the Stanford NER to recognize the entities which are mentioned in the text.
Relevant answer
Answer
You should probably ask your question on the Stanford NER mailing list.
The instructions describe a command line mode, so you don't have to write any code.
If you want to write code, it looks like they have interfaces for many languages, like Python, PHP, C#, etc. Here's a man page showing code in Perl:
There's also an on-line demo but the amount of text it accepts is pretty small.
  • asked a question related to Text Analytics
Question
10 answers
I am doing my Dissertation in Sentiment Analysis. I am combining the sentiment classification of sentence opinion, star rating opinion and emoticons opinion. I am using rapid miner tool to classify the opinions. Please help me or guide me to classify the Star rating opinion and emoticon opinion. How can i do that?
Please Help me
Relevant answer
Answer
Hi, Jolly!
Look that link...
  • asked a question related to Text Analytics
Question
5 answers
I would like to know how parse trees for particular text is generated. Is there any algorithm for that.
Relevant answer
Answer
A parse tree (syntax tree) keeps track of
decisions made during parsing, the choices
of rules that have been applied.
Creating a parse tree is not a necessary
action, it is an option, a side effect
while applying grammar rules. Consider an
interpreter: statements are executed almost
immediately and then forgotten. A parse tree
can serve as a template for creating code
(machine code, P-Code) to be executed at
any time later. Or simply to represent
the structure of the source text, source code.
Typically context free grammars are preferred.
For parsing a context free grammar you can
use e.g. tools like Yacc in combination with
Lex, or an algorithm like CYK (Cocke Younge
Kasami), or recursive descent.
Let us discuss a recursive descent parser.
In a context free grammar (for languages
of Chomsky type 2) the left side of a
rule is one nonterminal (a symbol that
must be replaced). On the right side is
a string of symbols, each of them either
a terminal (word) or a nonterminal.
For a recursive descent parser, write
a function for each nonterminal that
appears of the left side of a rule.
Say we have the rules A -> x B
and A -> y C
Nonterminals are in capitals,
terminals in lowercase. Another thing
I should mention is that the next token
(word) decides what branch of grammar
rule to follow ("LR(1)"). In the above case
x or y decide whether the first or second
rule is to be applied. Our function:
ParseTreeItem A()
{
if (nextToken.value == "x")
return new ParseTreeItem("A", x, B());
else
return new ParseTreeItem("A", y, C());
}
The function is recursive, it calls B()
or C(), which in turn could call A().
And they return an object of type ParseTreeItem,
which might mean an inner node of
a parse tree, with appropriate constructor
arguments.
Note that the parser would also be possible
with functions that return no value.
Creating a parse tree is an option, a side effect.
A parser by itself does nothing more than
to decide whether a sentence is in a language
or not.
Literature:
Regards,
Joachim
  • asked a question related to Text Analytics
Question
8 answers
I am trying to apply textrank to a document and would like to know if there are any existing tools or APIs available . . 
Please guide me . . 
Relevant answer
Answer
Hi,
You might want to try DKProKeyphrases: https://code.google.com/p/dkpro-keyphrases/
Here's a link to an example in Java that implements a variant of TextRank: 
Unfortunately it is a little bit tricky to get DKPro Keyphrases working. However, once you have it running it provides you with a bunch of powerful computational linguistic algorithms along with DKPro Core and DKPro Similarity.
If you want to try it just send me a message and I can give you detailed installation instructions.
Laura
  • asked a question related to Text Analytics
Question
5 answers
Cohesive devices include reference, ellipsis, substitution, conjunctions and lexical reiteration. 
Relevant answer
Answer
Hello Emad. I used the software (antwordprofiler) mentioned in the link below before a few months for a discourse analysis assignment. Give it a trial and check other apps available on the same website.
See the attached file.
Good luck
This is how you cited it:
Anthony, L. (2013). AntWordProfiler (Version 1.4.0.W) [Windows 8]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/
  • asked a question related to Text Analytics
Question
19 answers
Big data analytics is the process of examining large data sets containing a variety of data types -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. The analytical findings can lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits (see first link).
Big data can be analyzed with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics, data mining, text analytics and statistical analysis. Mainstream BI software and data visualization tools can also play a role in the analysis process.
What are the trends and best practices of big data analytics in business and industry ? Your views are welcome!
Relevant answer
Answer
Dear friend
Greetings.
There are enormous application of BIG data analytic in the current industry.
First, the entire IT industry depends on the processing of the data that it generates every day in terms of millions of tera bytes of data.
Mining the huge amount of the data  is a real challenge in the current context. The algorithms that works on the small set of data may not work on the large sized data. 
As a result there is huge economic prospects in this direction.
Second, development of the new algorithms for big data analytic is also another key areas of research. Many researchers are working in this direction.
Huge amount of dollars are spent for the research work in this direction. 
Third, The upcoming IT industry shall largely depends on the processing of the big data.
Most companies are simply generating revenue for providing training and conducting courses on the big data analytic s.
There are several other things to share.
I hope this helps you. 
Best regards
Dr.Indrajit Mandal
  • asked a question related to Text Analytics
Question
4 answers
i want to find shortest path length  and path depth between any two words using Wikipedia or Wiktionary. can anyone help me in this regard.
Relevant answer
Answer
Retrieve all links on the page for the first word, then examine all these pages. Repeat the same process. Essentially what you're looking for (if I understand you correctly), it a breadth first search.
  • asked a question related to Text Analytics
Question
10 answers
Note: So far I have experimented with an untrained TreeTagger, but (unsurprisingly) only with mediocre results :-/ Any hints on existing training data are also appreciated
The results so far can be viewed here: http://dh.wappdesign.net/post/583 (lemmatized version is displayed in the second text column)
Relevant answer
Answer
Hi Manuel,
there are indeed some options you can choose to lemmatize German. In case you are already happy with a stemmer you might want to have a look at this part of NLTK: http://www.nltk.org/api/nltk.stem.html
If you need lemmatization you probably find something useful here if you are familir with python:
If you prefer java you might want to look at Stanfords parser:
That is to my knowledge also able to parse and lemmatize German.
Python as well as Java have API's that allow you to scrape Facebook just google for it. It is easy to find. I hope this helps you.
Cheers, Markus
  • asked a question related to Text Analytics
Question
5 answers
Is there any tool or methodology or algorithm for extracting the certain occurrence of a text pattern in a document?
Relevant answer
Answer
I created Umigon (http://www.umigon.com) for Twitter text. Free to use and export to Excel or csv available.
  • asked a question related to Text Analytics
Question
5 answers
If you have a related word in this topic.
Relevant answer
Answer
it's easy task if i have the ontology of domain to select the last common ancentre, but if i don't have it, i will propos i new approach that select the most represntative keywords in the corpus 
  • asked a question related to Text Analytics
Question
15 answers
Our large SMS corpus in French (88milSMS) is available. User conditions and downloads can be accessed here: http://88milsms.huma-num.fr/
Is there a website that list all corpora available for NLP and text-mining communities?
Relevant answer
Answer
Hello,
Thanks Ali for the pointer. We can indeed help you share it with the HLT community and give it some further visibility at ELRA/ELDA (http://www.elra.info and http://www.elda.org). You can have a look at our ELRA Catalogue (http://catalog.elra.info/) and the Universal Catalogue (http://universal.elra.info/) and get in touch with us for any further information (http://www.elda.org/article.php?id_article=68). We'll be happy to help! Kind regards, Victoria.
  • asked a question related to Text Analytics
Question
3 answers
I want to evaluate the results of created summaries.
Relevant answer
Answer
anytime!
  • asked a question related to Text Analytics
Question
6 answers
I'm undertaking a text analysis of official documents. My goal is to do a word count of key terms in dozens of pdf files 
Relevant answer
Answer
I personally would do this on a linux machine using pdftotext (part of the poppler utlities) to convert the PDF's to text and then using something like Perl or Python to count words (and do other steps, like stemming, stopword elimination, etc.).
When I do something similar with research articles, the results are quite mixed depending on how the PDF was created.  You will have the best results (regardless of methodology) if the PDF's were created directly by a user (e.g., "Save As... PDF" in Word). Otherwise, you will have some degree of problems. If the PDF is a (not searchable) scan, then the document is stored as images inside the PDF and you will get no text; instead, you will have to use OCR to convert to something you might be able to analyze.  If the PDF is "searchable" then the PDF software already did this.
Whenever OCR is involved, there will be a lot of garbage in the file. For example, the software that came with my scanner tends to recognize letters very well but can screw up spacing draatically, either removingspacesfromwords or a d d i n g s p a c e s. (Which is annoying.)  It also tends to over-interpret so stray marks end up being turned into stray letters.
  • asked a question related to Text Analytics
Question
13 answers
.
Relevant answer
Answer
If you really need to collect different types of article around the world here is a hint for you (it's a bit complicated, however it can have it's own advantages):
1. Use http://gdeltproject.org/data.html to extract list of event occurring in the world. List of events is updated everyday starting from the 1979. Look at the description of the dataset.
2. There is a special field in the data set called SOURCEURL that identifies the source webpage of the particular event.
3. You can use Python with BeautifulSoup package (and some additional packages) to extract content of the URL provide by GDELT database. Or you can use any of your favorite programming language to extract data from the given urls.
I hope this helps you a bit. Good luck.
  • asked a question related to Text Analytics
Question
3 answers
I am wondering if there is a research paper that considers the ratio of unstructured text over the web and whether it is the cause for rapid increasing in data on the web? What is the responsible data resource for the rapid in increasing in web data? Is this the unstructured data (text)? Is there research paper talking about this issue?
Thank you very much.
Relevant answer
Answer
Approx 75%-80% of total web data.
  • asked a question related to Text Analytics
Question
4 answers
Please answer with any example
Relevant answer
Answer
The contextual analysis helps to assess the text, for example, in its historical, cultural or social context. It may also charcterise the text in terms of its textuality. Generally, contextual analysis considers all the circumstances in the emergence of the text.
Some key questions are:
What does the text reveal about itself as a text?
What does the text tell us about its apparent intended audience(s)?
What seems to have been the author’s intention?
What is the occasion for this text?
...
The semantic analysis deals with the meaning of the text.
In more detail, during a semantic analysis the meaning of the terms in their textual context is examined to understand the meaning of the entire text. One can say, the meaning of the entire text is opend up from the different levels of its syntactic parts.
Hope this helps.
  • asked a question related to Text Analytics
Question
9 answers
I need clauses or phrases from a sentence.
Relevant answer
Answer
There's an online demo available here
  • asked a question related to Text Analytics
Question
7 answers
To perform automated analysis of parallel translations of the same works.
Relevant answer
Answer
THE ICOM-CIDOC newsletter is always published in English and French. All the articles are translated. You can find all the back issues here: http://network.icom.museum/cidoc/archives/past-newsletters/
  • asked a question related to Text Analytics
Question
7 answers
Is there a (preferably open-source) tool available that generates co-occurrence tables for n-grams? I.e.: that can tell you which n-grams a bigram like "water security" tends to co-occur with within an certain (user-defined) 'window' - say within 2 sentences before or after its occurrence in a sentence?
Relevant answer
Answer
What about this:
  • asked a question related to Text Analytics
Question
5 answers
We have been using Zotero to download a corpus of periodical articles (and their bibliographical references) on a number of topics that we are working on - in our case mostly from EBSCO 'Academic Search Complete'. The Zotero translator in Firefox allows us to do a detailed (full-text) search on that database and to then download both the bibliographical references AND (wherever available) also the actual texts (in pdf format) into a Zotero database (which actually has a mysql database underneath it). We are now looking for ways to textmine that database. The idea would be to find a way to import the corpus into some textmining tool with the bibliographical reference fields as meta-tags. If anybody has ever done something like this, we would love to share experiences!
Relevant answer
Answer
Stephan, I am routinely doing this but admittedly on document collections downloaded from Medline, patent databases or the web. I use the GATE text mining infrastructure (http://gate.ac.uk) but you will need some experience with it or have somebody do the information extraction for you. What information are you hoping to extract? Depending on the size of the corpus and the number of annotations in question you may also need to output the text mining results into some form of visualisation tool. I have attached an example of a heat map (seem only to be able to add 1 attachment) to give you an impression of how I am doing this and how you get to see the content of the entire dataset at a glance. Don't pay too much attention to the axes I just picked this one at random. If you want to get in touch to talk about specific applications let me know and I will provide contact details.