Science topic
Text Analytics - Science topic
Explore the latest questions and answers in Text Analytics, and find Text Analytics experts.
Questions related to Text Analytics
I need to write a literature review on the topic of text analytics in digital marketing
How should ChatGPT and other similar intelligent chatbots be improved so that they do not generate plagiarism of other publications that their authors have previously posted online?
This issue is particularly important, because it happens that the data entered into ChatGPT, the information contained in the texts entered for the purpose of automated rewriting, remains in the database that this chatbot uses in the situation of generating answers to questions asked by subsequent Internet users. The problem has become serious, as there have already been situations where sensitive data on specific individuals, institutions and business entities has been leaked in this way. On the other hand, many institutions and companies use ChatGPT in the preparation of reports, editing of certain documents. Also, pupils and students use ChatGPT and other similar intelligent chatbots to generate texts that act as credit papers and/or from which they then compose their theses. On the other hand, functions have been added to some existing anti-plagiarism applications to detect the fact that ChatGPT is being used in the course of students' writing credit papers and theses. In addition to this, the problem is also normative in nature, as it is necessary to adapt the legal norms of copyright law to the dynamic technological advances taking place in the development and application of generative artificial intelligence technology, so that the provisions of this law are not violated by users using ChatGPT or other similar intelligent chatbots. Among the important issues that could significantly reduce the scale of this problem would be the introduction of a mandatory requirement to mark all works, including texts, graphics, photos, videos, etc., that have been created with the help of the said intelligent chatbots, that they have been so created. On the other hand, it is necessary for the AI-equipped chatbots to be improved by their creators, by the technology companies developing these tools, in order to eliminate the possibility of ChatGPT "publishing" confidential, sensitive information from institutions and companies in response to questions, commands, tasks of developing a certain type of text by subsequent Internet users. In addition, the said intelligent chatbots should be improved in such a way that if in the course of automated text generation, including inspiration from other source texts, "quoting" whole sentences, substantial fragments of them, substantive content of other publications but without fully showing the sources, i.e. without a full bibliographic description of all the source publications that the chatbot generating subsequent texts used. In addition, the user of the aforementioned intelligent chatbots does not know to what extent the text they created with the help of these tools is plagiarized from other texts previously entered into them or from publications published on the Internet, including documents of companies and institutions, theses, scientific publications, industry articles, journalistic articles, etc.
I described the key issues of opportunities and threats to the development of artificial intelligence technology in my article below:
OPPORTUNITIES AND THREATS TO THE DEVELOPMENT OF ARTIFICIAL INTELLIGENCE APPLICATIONS AND THE NEED FOR NORMATIVE REGULATION OF THIS DEVELOPMENT
In view of the above, I address the following question to the esteemed community of scientists and researchers:
How should ChatGPT and other similar intelligent chatbots be improved so that they do not generate plagiarism of other publications that their authors have previously posted on the Internet?
How should ChatGPT be improved so that it does not generate plagiarism of other publications that their authors have previously posted on the Internet?
And what is your opinion about it?
What is your opinion on this issue?
Please answer,
I invite everyone to join the discussion,
Thank you very much,
Best wishes,
Dariusz Prokopowicz
The above text is entirely my own work written by me on the basis of my research.
In writing this text I did not use other sources or automatic text generation systems.
Copyright by Dariusz Prokopowicz

To what extent does the ChatGPT technology independently learn to improve the answers given to the questions asked?
To what extent does the ChatGPT consistently and successively improve its answers, i.e. the texts generated in response to the questions asked, over time and when receiving further questions using machine learning and/or deep learning?
If the ChatGPT, with the passage of time and the receipt of successive questions using machine learning and/or deep learning technology, were to continuously and successively improve its answers, i.e. the texts generated as an answer to the questions asked, including the same questions asked, then the answers obtained should, with time, become more and more perfect in terms of content and the scale of errors, non-existent "facts", new but not factually correct "information" created by the ChatGPT in the automatically generated texts should gradually decrease. But has the current, next generation of ChatGPT 4.0 already applied sufficiently advanced, automatic learning to this tool to create ever more perfect texts in which the number of errors should decrease? This is a key question that will largely determine the possibilities for practical applications of this artificial intelligence technology in various fields, human professions, industries and economic sectors. On the other hand, the possibilities of the aforementioned learning process to create better and better answers to the questions asked will become increasingly limited over time if the knowledge base of 2021 used by ChatGPT is not updated and enriched with new data, information, publications, etc. over an extended period of time. In the future, it is likely that such processes of updating and expanding the source database will be carried out. The issue of carrying out such updates and extensions to the source knowledge base will be determined by the technological advances taking place and the increasing pressure on the business use of such technologies.
In view of the above, I address the following question to the esteemed community of scientists and researchers:
To what extent does ChatGPT, with the passage of time and the receipt of further questions using machine learning and/or deep learning technology, continuously, successively improve its answers, i.e. the texts generated as a response to the questions asked?
To what extent does the ChatGPT technology itself learn to improve the answers given to the questions asked?
What do you think about this topic?
What is your opinion on this subject?
Please respond,
I have described the key issues of opportunities and threats to the development of artificial intelligence technology in my article below:
OPPORTUNITIES AND THREATS TO THE DEVELOPMENT OF ARTIFICIAL INTELLIGENCE APPLICATIONS AND THE NEED FOR NORMATIVE REGULATION OF THIS DEVELOPMENT
Please write what you think in this issue? Do you see rather threats or opportunities associated with the development of artificial intelligence technology?
What is your opinion on this issue?
I invite you to familiarize yourself with the issues described in the article given above and to scientific cooperation on these issues.
I invite you to scientific cooperation in this problematic.
Please write what you think in this problematics?
I invite you all to discuss,
Thank you very much,
Warm regards,
Dariusz Prokopowicz
The above text is entirely my own work written by me based on my research.
In writing this text, I did not use other sources or automatic text generation systems.
Copyright by Dariusz Prokopowicz

I've looked at Transana as well, but would prefer not to have to transcribe all parts of all of the videos.
How can artificial intelligence such as ChatGPT and Big Data Analytics be used to analyse the level of innovation of new economic projects that new startups that are planning to develop implementing innovative business solutions, technological innovations, environmental innovations, energy innovations and other types of innovations?
The economic development of a country is determined by a number of factors, which include the level of innovativeness of economic processes, the creation of new technological solutions in research and development centres, research institutes, laboratories of universities and business entities and their implementation into the economic processes of companies and enterprises. In the modern economy, the level of innovativeness of the economy is also shaped by the effectiveness of innovation policy, which influences the formation of innovative startups and their effective development. The economic activity of innovative startups generates a high investment risk and for the institution financing the development of startups this generates a high credit risk. As a result, many banks do not finance business ventures led by innovative startups. As part of the development of systemic financing programmes for the development of start-ups from national public funds or international innovation support funds, financial grants are organised, which can be provided as non-refundable financial assistance if a startup successfully develops certain business ventures according to the original plan entered in the application for external funding. Non-refundable grant programmes can thus activate the development of innovative business ventures carried out in specific areas, sectors and industries of the economy, including, for example, innovative green business ventures that pursue sustainable development goals and are part of green economy transformation trends. Institutions distributing non-returnable financial grants should constantly improve their systems of analysing the level of innovativeness of business ventures planned to be implemented by startups described in applications for funding as innovative. As part of improving systems for verifying the level of innovativeness of business ventures and the fulfilment of specific set goals, e.g. sustainable development goals, green economy transformation goals, etc., new Industry 4.0 technologies implemented in Business Intelligence analytical platforms can be used. Within the framework of Industry 4.0 technologies, which can be used to improve systems for verifying the level of innovativeness of business ventures, machine learning, deep learning, artificial intelligence (including e.g. ChatGPT), Business Intelligence analytical platforms with implemented Big Data Analytics, cloud computing, multi-criteria simulation models, etc., can be used. In view of the above, in the situation of having at one's disposal appropriate IT equipment, including computers equipped with new generation processors characterised by high computing power, it is possible to use artificial intelligence, e.g. ChatGPT and Big Data Analytics and other Industry 4.0 technologies to analyse the level of innovativeness of new economic projects that plan to develop new start-ups implementing innovative business solutions, technological, ecological, energy and other types of innovations.
In view of the above, I address the following question to the esteemed community of scientists and researchers:
How can artificial intelligence such as ChatGPT and Big Data Analytics be used to analyse the level of innovation of new economic projects that plan to develop new startups implementing innovative business solutions, technological innovations, ecological innovations, energy innovations and other types of innovations?
What do you think?
What is your opinion on this subject?
Please respond,
I invite you all to discuss,
Thank you very much,
Warm regards,
Dariusz Prokopowicz

Hi,
Most of the researchers knew R Views website which is:
Please, I am wondering if this website contains all R packages available for researchers.
Thanks & Best wishes
Osman
Which free software for text analytics tasks (such as text mining and sentiment analysis) can be recommended for social science researchers and students who do not have much background in programming or data science?
I understand that R could be the obvious answer for many, given its capabilities, I am specifically looking to shortlist 4 to 5 GUI/point-and-click options which can be recommended to early researchers and post graduate students in social sciences, especially Psychology.
I have experimented with KNIME and Orange, but won't certify them as 'friendly enough'. This could be because I did not spend enough time on them, though.
Dear researchers, I welcome everyone! 🙂
I'm currently preparing an article in which I plan to use distributed data processing tools for text analytics tasks. In particular, Apache Spark. One of the criteria for the quality of a distributed computing system is the task execution time. This criterion is on the surface.
Question. Which of the criteria can additionally serve as an assessment of the quality of a distributed computing system?
In a new project I want to capture emotions in texts written by students during their studies.
I assume that the majority of these texts are factual and contain few emotions.
- Am I wrong, do student texts contain emotions from a semantic or psycholinguistic point of view?
- Is there any literature on semantic, psycholinguistic text analyses or sentiment analyses of student texts written during their studies?
I am looking for a software for searching a text in a set of files. Any recommendations?
It should be something similar to the Multi Text Finder.
The aim is to teach students to find an important information in documents.
The question of the accuracy of prognostic research carried out on the basis of an analysis of the sentiment of verifying the comments of Internet users archived in Big Data systems?
Currently, many research centers use sentiment analysis by examining the content of comments, posts and news of thousands or millions of users of social media portals and other websites. The sentiment analysis is carried out on large collections of information collected from the deliberately selected many websites and stored in Big Data database systems. This analysis is carried out periodically at specific time intervals to diagnose changes in the main trends of general social awareness, including opinions on specific topics in society. This analysis may also refer to the diagnosis of dominant political sympathies, specific political views and opinions on selected political topics. This analysis is also used to examine the public support at a given time for specific politicians, candidates in the presidential elections or parliamentary elections. If this type of sentiment analysis is carried out directly by presidential or parliamentary elections, then it can be treated as an additional research instrument of a prognostic nature. There were results obtained from this type of prognostic analyzes characterized by a high level of accuracy of prognostic research.
In view of the above, the current question is: The question of the accuracy of prognostic research carried out on the basis of an analysis of the sentiment of verifying the comments of Internet users archived in Big Data systems?
Please, answer, comments. I invite you to the discussion.

Deeper Learning is a set of educational outcomes of students that include the acquisition of robust core academic content, higher order thinking skills and learning dispositions.
Education must enable the mastery of skills such as analytical thinking, complex problem solving and teamwork.
- What are the characteristics of deeper learning that can be captured in student texts?
- Is there a collection of words, terms, phrases, sentences etc. that indicate or characterise deeper learning?
We are working on using machine learning to capture the level of deeper learning from students' responses.
With this, we want to investigate the learning process of specially developed tasks.
We are looking for existing instruments that make this possible and are happy and grateful for hints and help.
I want an Arabic dataset specially in chatting
Thanks
Want to know about current research trends in Machine learning and Natural language Processing (NLP) - Code-mixed text in detail as soon as possible. This is for a research project of (theoretical) computer science. Thanks in advance.
I am looking for a pre-trained Word2Vec model on English language. I have used a model trained on Google news corpus. Now I need a model trained over Wikipedia corpus. I tried one downloaded from [https://github.com/idio/wiki2vec/] , but it didn't worked. I am using Python 3.4 and the model was trained over Python 2.7. Anyone would like to share such model which already used over Python 3.4.
We used SPSS to conduct a mixed model linear analysis of our data. How do we report our findings in APA format? If you can direct us to a source that explains how to format our results, we would greatly appreciate it. Thank you.
I have trained Word embedding using a "clean" corpus in fastText and and I want to compare the quality of the Word embedding obtained against the word embedding from the pre-trained multi-lingual embedding in BERT which I perceive(discovered) to be trained on a "very-noisy" corpus(wiki).
Any Suggestions or Ideas on how to go about evaluating/comparing the performance would be appreciated.
hLDA has C code available. However, I am not able to find R or Python implementation of same.
Hi everyone,
I am doing sentiment analysis and using the convolutional and recurrent neural network. I want to train my neural network by using pre-trained weight, but the weight files size like GoogleNews-vectors and freebase-vectors is too large when its come to my system computing power. So, please let me know if anyone aware of other open-source small size pre-trained word2vec weight files.
Your response will be highly appreciable.
Hello every one, please i need your help
I'm looking for a textmining technique that can be good for predicting learner engagement from their participation in online discussion
Hope for your help
Best regards
The concept of Business analytics, through SAS and many other softwares and tools, has significant importance in industry but academic research is still in nascent stage. The sub-division of business analytics is marketing analytics, where all the concepts of data analytics are observed in terms of marketing science paradigm.
I have read three books tilted as:
- Marketing Analytics Roadmap: Methods, Metrics, and Tools by Jerry Rackley
- Marketing Analytics: A Practical guide..by Mike Grigsby
- Predictive marketing by Omer Artun
Moreover, The upcoming conferences of AMA (American Marketing Association) have also highlighted the topic of marketing analytics along with many others IT intensive topics. Furthermore, there are a few thesis available on ProQuest as well.
Still, A broad concept of marketing analytics does not guarantee that the "problem statement" devised from all this can create a notable 'ripple' in academia.
Conclusively, i want to pursue "marketing analytics" for my PhD and becoming a professional as well.
So, It is requested to all researchers and professionals out there to kindly guide me.
Dear all,
Do you know any available data set for text summarization-with text summaries?
I'm looking to analyze reflective observations in Korean early childhood educators. I've looked at IBM Watson - natural language understanding, Standford's Natural Language Processing Group, and Provalis research text analytics software. Any suggestions?
I am trying to do a topic modeling study of a dataset of about 4 million tweets using Mallet and running into issues with working memory, or "heap space." My computer does have around 15 GB of working memory, but Mallet, by default, utilizes only 1 GB. So I was getting the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I expanded the Mallet heap space allocation in the manner prescribed on https://programminghistorian.org/lessons/topic-modeling-and-mallet#issues-with-big-data. But it didn't help. So I was wondering if anyone had a solution.
Thanks.
Currently I am working on a big .XML file from where I am retrieving useful data and discarding unnecessary data. But doing so is taking a lot of time.
So is there any instance where use of parallel programming(e.g., CUDA) is done to reduce time used to complete such task; in my case text preprocessing?
I know there are various available tools for parsing and retrieving high level info like (protocol,TOS,size,src ip/port and dest ip/port,timestamp etc)
From this available info what are the different approaches to pre -process this data to make usefull info out of it similar to kdd'99 dataset.
If i have to get the most possible generic steps of text analytics, what are the most commonly used steps for any text analysis model.
Any help and your expert guidance/ suggestions are welcome.
Thanks in advance
I have collected DUC 2005 and 2006 to evaluate my query focused multi-document summarization using ROUGE. However, I can not find the queries that were used to generate the reference summaries. Can someone tell me what queries are used? The data can be found here https://www.dropbox.com/s/e42n246x5721zrm/DUC2005.zip?dl=0
I am having a finite set of subjects. Now I want to find which subject a tweet of a given twitter user belongs to show that I can learn the topic of interest for the twitter user. Which classifier would be most suitable for tweets that have small number of words.
Hi ,
I know that most of existing probabilistic and statistical term-weighting schemes (TF-IDF and its variation) are based on linked independence assumption between index terms. On the other hand, semantic information retrieval are seeks the importance of linked dependence between index terms each other.
Please, I am wondering when linked dependence between index terms is vital ? When also can we neglect linked dependence between index terms?
Note: dependence assumption: if two index terms have the same occurrences in the document, this will tend to that index terms are dependent and they should have the same term-weight values.
Thanks
Osman
my problem is tage sentenes by their lexical catagory i have a sentence having 5 words and their 5tags in front of it. actualy i need to predict the Y verctor having 5 tags for the X vector having 5 words at the same time i mean in a row of the data set
is there any alternative approach to this problem i want to capture the context
also how can i implement so to predict values of 5 labels at a time ?
My research about POS Tagging and I have to make new corpus because the language corpus doesn`t exist. How to I create new corpus? It`s use plain text and save into .txt file or the other?
Thank You
Are there any good packages that will scan and match patterns of characters in different file?
Hi,
I need to classify a collection of documents into predefined subjects. The classification is based on TF-IDF. How can I determine whether unigrams or bigrams or trigrams...or n-grams would be most suited for this? Is there any formal or standard way to determine this?
Also, how to determine the most appropriate number of features I should consider?
Any help would be highly appreciated.
Manjula.
My requirement is user feedbacks on restaurant data and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
We are trying to prototype an automatic life insurance underwriting system and wanted to know the must promising method. We also realize that a black box system is problematic if we cannot explain the rules to an auditor.
I started looking at "standard" large datasets, like the GSS, but I can't seem to find any that have open-ended questions.
I am working on keyword extraction problem from text documents. I have implemented the algorithm proposed in "Main Core Retention on Graph-of-words for Single-Document Keyword Extraction". I am not able to reproduce the graph given in the paper. Can somebody help?
The link for the paper is given.
I have used the same preprocessing steps and preprocessing tools as given in the paper. The sliding window size for graph creation is also kept as 4. I have implemented the algorithm in R using igraph, tm, openNLP, and NLP packages.
I am trying to find out if we should search and collect specific users based on race and gender and then code their tweets. OR if I collected a specific group of tweets based on a date range I could then figure out the users demographics based on the meta data.
Thanks for any input.
Good afternoon,
I have to conduct a search with respect to the information/support sources on visual impairment at layperson disposal on the Internet (i.e. websites, blogs, facebook...).
The matter is to know on the one hand "what is there" on the Internet and, on the other hand, to analyze the resources found to determine their goodnesses and shortcomings. The latter is not the problem for me (the literature on this topic is quite extensive), what I do not know is if there is a rigorous procedure to follow when surfing the Internet and selecting the results (e.g. as when a systematic review on the written literature is conducted).
I mean: it would be proper to select e.g. the first 20 Google search results according to some inclusion-exclusion criteria? 20 is enough? it is too little? where must be the limits?
If you have done something similar, have you followed any methodological guidelines?
Thanks,
Marta
If I have two techniques which handle text detection from image, on depend what can I compare of the methods?
Specifically, I would be interested to see whether a keyword can be searched in social media, say Facebook, and all comments can be fetched and analyzed by using text analytic software. Thank you.
Dear all,
how can i add the missing chain id in pdb file?
I'm trying to use pdb-mode emacs. However I'm stuck on how to install the pdb-mode in emacs. below are the instruction, but i don't know how to use it.
To use pdb-mode.el, download it and put it somewhere. Add the following lines to your ~/.emacs file (or get your sysadmin to add it to the site-start.el), fire up (x)emacs and visit a PDB file (with suffix .pdb).
(load-file "/{path-to}/pdb-mode.el")
(setq auto-mode-alist
(cons (cons "pdb$" 'pdb-mode)
auto-mode-alist ) )
(autoload 'pdb-mode "PDB")
I don't know how to put the command line in the .emacs file as mention above.
anybody who are familiar with this please help me. I really need a help I'm new in this area.
edit: I added my own solution as an answer below. I hope it helps in case you are facing the same problem as I was.
Hi,
I am working with the IAM database for writer identification. This database is very poorly organized. Whats strange is that almost every paper in my research area has used it in their paper, yet I cannot find a single place where it is organized properly. The database has 657 writers, each writer writing different number of samples (ranging from 1 to 4 documents per writer). There is no structure or folders like there is in other well known databases such as the CVL, AHTID, ICFHR, ICDAR etc.
The official download link just provides a gzip file that has all the images in one folder, not labeled.
I am having a lot of difficulty arranging it. It is honestly the most well known database and almost everyone has used the full database in their work. I do not know how they arranged it. Some papers mention that for writers with over 3 documents 2 were used for training and 1 for testing, and with writers having one document, was divided in half and half used for testing and half for training.
If I go by this structure (which I have to for fair comparison of work) even then it will take me days to manually go over every document and label it. This method will lead to errors.
I'm just asking here in hopes that someone from writer identification research area will read this and help me in acquiring the sorted version of the database.
Also even if I do manage to sort the documents, they must be segmented to divide the typed text from the handwritten one.
Why is this database so popular. Why am only I having this much difficulty with it??
Thanks for reading
I am a novice in text analysis. I need to establish a hierarchy between words present in a document. Each line of the document contains on an average 5 words.
E.g:
dogs with cute face
siberian husky
cocker spaniel dog
cute puppy
...
Now, I want to create a hierarchy that will state something like this:
dog->breed->cocker spaniel
dog->youngling->puppy
I came across "ontology" but since my data size is quite large, establishing all kinds of relationships is quite cumbersome. I was wondering, instead, if I could simply create a hierarchy of such concepts. Is it possible? Are there any existing tools for the same?
Framework that allows adding more functionalities and implementing those modified algorithms as real recommender?
I'm looking for a dataset that contains transactional data (i. e. it must contain different cases identified by unique IDs that appear with several action and timestamps throughout the log) as well as free text.
Let me give you an example of how this could look like: incident service management, where (free text) complaints can be submitted and then be processed by several resources until resolved.
Is anyone aware of such dataset? Thanks in advance,
Tim
Having recently started a text mining project, I have been struggling with an R package called 'sentiment' while performing a sentiment analysis. The package is only available in the archives of CRAN, seems outdated and was not compatible with the most recent version of R (on my computer). Does anyone know an alternative or even better R package for sentiment analysis?
What are the methods that address a problem like flow or flow-violation of sentiment across a series of sentences?
Given a customer review R comprising of sentences {s1, s2, ..., sk};
each sentence si has a set of features Xi = {x1, x2, ...., xn}i ;
We fit a binary classifier which takes {X1, X2, ...., Xk} as input and gives sentiment polarity labels {Y1......Yk} as output.
There are ambiguous sentences which the classifier is not able to correctly classify. To improve this, we assume that there should be a flow of sentiment within a review, which is broken only under specific conditions.
E.g., a consumer may feel very positive about a laptop, except some of its aspects. So, sentence-after-sentence they say positive things (flow) but then one sentence is negative (flow-violation), marked by either a contrast term like 'but', 'however' etc. or by a strong negative term or both.
One way to check this flow or flow-violation is to use an auto-regression like model.
So, we predict Y values using the classifier. Then, we make corrections for a Yi using weighted sum of previous predicted labels and current value of Yi. The estimation of weights is a different problem.
However, I want to directly use a sequence of features Xi, Xi-1, Xi-2... in a classifier so it can learn the dependence on previous sentences itself. The problem is that this approach becomes computationally complex due to feature size. Also, manually tuning a auto-regression like function gives better result.
What are other methods that address a problem like flow or flow-violation of sentiment across a series of sentences?
Hi - I need some open source tools for complex text semantic analysis and co referencing.
OpenNLP fails to perform co referencing if text is long. What algorithm is used for co-referencing?
Thanks
Is there any research work where text recognition process is carried to extract text from vertical alignment?
Text similarity is a key point in text summarization, and there are many measurements can calculate the similarity. Some of them are used by most of researchers but I didn't find a strong justification why exactly those and not others. Is there any strong justification for using Cosine Similarity, Jaccard Coefficient , Eucledian Distance and Tanimotto Coefficient as a measurements for text similarity in text summarization approaches ??!!
The problem is that labels are not relevant to documents. Each could be mentioned at most once in a document or it's referred to it indirectly. It is neither topic modeling nor clustering problems. Please advise, which algorithm or tool should be used to label documents automatically?
Would anyone have a good tool or application (preferably open source) to recommend for comparing similarities and differences between two texts? In an ideal world I would be able to attach a code to each major difference or similarity and perhaps quantify them to some extent.
How can we interpret the results of Singular Value Decomposition forms the foundation for Latent Semantic Analysis when performed on the matrix to determine patterns in the relationships between the terms and concepts contained in the text.
Corpus must contain documents (texts) with hand annotated keywords by human experts.
Extracting causal relationships from texts is far from trivial, but there are quite a few intriguing pieces in the recent literature that discuss how this could be done. E.g. http://www.hindawi.com/journals/tswj/2014/650147/. The 'technology readiness level' of this work seems significantly below that of things like entity, sentiment, event, etc extraction. But at least some progress seems to have been made.
Given the availability of so many large full-text academic databases, it would be of course fantastic to be able to 'extract' all of the causal hypotheses that have been formulated over the years in various disciplines. But so does anybody know of any existing textmining tools that can already do this - even if it's just for English?
My research examines comparison of politeness strategies used by two groups of students.
I am aware of the higher level architecture of the same. But am really curious to know how the knowledge is represented, stored and retrieved. Is it a simple Ontology even for open domain QA problem?
there are a lot of text mining approach for grouping or clustering text (Kmean, KNN, LDA....) In my case, i have a set of short text (10 to 50 word) containing chemical formula and numbers ( as result for experiences)
I applied cosine similarity on 3000 text files. As a result i have similarity score with me. Now what sort of analysis can be performed on it ?
I mean i have different scores in floating point. They are showing similarity between text of files. what can be achieved from this ?
We are thinking through some of the problems in distinguishing reputable news from phony ones. What might give you a clue that an online story you are reading is bogus, fake, or unreliable? We'd appreciate examples of what appears to be a reliable news source and what doesn't. Worldwide. Any language.
Thanks so much! VR
I want to do a very simple job: given a string containing pronouns, I want to resolve them.
For example, I want to turn the sentence "Mary has a little lamb. She is cute." into "Mary has a little lamb. Mary is cute.".
I use jave and Stanford Coreference which a part of Stanford CORENLP. I have managed to write some of the code but I am unable to complete the code and finish the job. Below is some of the code which I have used. Any help and advice will be appreciated.
String file=" Mary has a little lamb. She is cute.";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(file);
pipeline.annotate(document);
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for(CoreMap sentence: sentences)
{
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
System.out.println(graph);
for (Map.Entry entry : graph.entrySet()) {
CorefChain c = (CorefChain) entry.getValue();
CorefMention cm = c.getRepresentativeMention();
System.out.println(c);
System.out.println(cm);
}
}
Dear all,
I am looking for an unsupervised approaches to identify the category of each Tag. In other terms, there is some approaches that use taxonomies or thesaurus to categorize Tags.
The goal is to classify a list of generated Tags into a set of different categories. Most of generated content are extracted from audio and video contents.
The question: Which analytical techniques are best suited for what type of problems and data sets? Many techniques are being proposed? How does one select the right technique?
Currently working on customer feedback & wish to classify text feedback on basis of tone of expression or broadly tone. .
There are many tools are used to find out the Name Entity Tagger such as Stanford CoreNLP. What is the most common Name Entity Tagger with a lower error rate for English British language?
I have got the samples of some civil engineering projects ( some floor plans, elevations). I consider them to be texts with professional civil engineering images. I have defined some typical grammar structures and words typical for notes. How can I analyze them from another viewpoint?
To perform aspect based opinion mining we first need to extract aspects or say topics for a document ( in this case short text like online reviews / tweets ). Will techniques like LDA or multi grain LDA give up to the mark performance for such kind of topic extraction?
, i.e., finding the existence and quantity of a set of adjectives from a given set of sentences where the sentences do not contain the adjectives?
Is it important to use word frequency analysis during searching using keywords? If so, any recommended method?
I would like a code to run Stanford Named Entity Recognizer (NER). Suppose that I have text and I would like the Stanford NER to recognize the entities which are mentioned in the text.
I am doing my Dissertation in Sentiment Analysis. I am combining the sentiment classification of sentence opinion, star rating opinion and emoticons opinion. I am using rapid miner tool to classify the opinions. Please help me or guide me to classify the Star rating opinion and emoticon opinion. How can i do that?
Please Help me
I would like to know how parse trees for particular text is generated. Is there any algorithm for that.
I am trying to apply textrank to a document and would like to know if there are any existing tools or APIs available . .
Please guide me . .
Cohesive devices include reference, ellipsis, substitution, conjunctions and lexical reiteration.
Big data analytics is the process of examining large data sets containing a variety of data types -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. The analytical findings can lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits (see first link).
Big data can be analyzed with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics, data mining, text analytics and statistical analysis. Mainstream BI software and data visualization tools can also play a role in the analysis process.
What are the trends and best practices of big data analytics in business and industry ? Your views are welcome!
i want to find shortest path length and path depth between any two words using Wikipedia or Wiktionary. can anyone help me in this regard.
Note: So far I have experimented with an untrained TreeTagger, but (unsurprisingly) only with mediocre results :-/ Any hints on existing training data are also appreciated
The results so far can be viewed here: http://dh.wappdesign.net/post/583 (lemmatized version is displayed in the second text column)
Is there any tool or methodology or algorithm for extracting the certain occurrence of a text pattern in a document?
If you have a related word in this topic.
Our large SMS corpus in French (88milSMS) is available. User conditions and downloads can be accessed here: http://88milsms.huma-num.fr/
Is there a website that list all corpora available for NLP and text-mining communities?
I want to evaluate the results of created summaries.
I'm undertaking a text analysis of official documents. My goal is to do a word count of key terms in dozens of pdf files
I am wondering if there is a research paper that considers the ratio of unstructured text over the web and whether it is the cause for rapid increasing in data on the web? What is the responsible data resource for the rapid in increasing in web data? Is this the unstructured data (text)? Is there research paper talking about this issue?
Thank you very much.
Please answer with any example
I need clauses or phrases from a sentence.
To perform automated analysis of parallel translations of the same works.
Is there a (preferably open-source) tool available that generates co-occurrence tables for n-grams? I.e.: that can tell you which n-grams a bigram like "water security" tends to co-occur with within an certain (user-defined) 'window' - say within 2 sentences before or after its occurrence in a sentence?
We have been using Zotero to download a corpus of periodical articles (and their bibliographical references) on a number of topics that we are working on - in our case mostly from EBSCO 'Academic Search Complete'. The Zotero translator in Firefox allows us to do a detailed (full-text) search on that database and to then download both the bibliographical references AND (wherever available) also the actual texts (in pdf format) into a Zotero database (which actually has a mysql database underneath it). We are now looking for ways to textmine that database. The idea would be to find a way to import the corpus into some textmining tool with the bibliographical reference fields as meta-tags. If anybody has ever done something like this, we would love to share experiences!