Text Mining - Science topic
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
Questions related to Text Mining
I am working on my research dissertation. I want to use LDA (Latent Dirichlet Allocation Model) on my data. I found out about program Orange Data Mining programme (available here: https://orangedatamining.com/).
Does anyone know how to do correctly LDA in this programme?
And for everyone who knows how to do correctly LDA: what results have to be reported about my LDA analysis? I guess I have to report it like this:
Any advice on LDA (how to do it in R per example or in program Orange ...) will be very helpful. I am a beginner at this method but I really want to use it cause it is the best method for my research question.
I am doing LDA (Latent Dirichlet Allocation) in R. I have two questions.
I am analysing comments from Slovenian social media. But I organised the data as follows and I would be happy if someone could help. In the Excel document, I have the authors of the comment written in one column, and I have the content of the comment written in the other column. So I have approx. 4,000 rows and each row has two columns - one for the author and one for the comment. I had all of these comments “separate” in my document, but I wanted to combine them. I obtained each group of comments from individual web portals (eg. Facebook posts, comments under articles, Reddit debates, ...). And I combined all these documents of comments into two columns. So now all comments are written in one column. That's my corpus (it is binary now - all the comments in one row). Can I use the LDA in the R program on this data set? Or do comment groups need to be separated into individual documents for the LDA method? I hope my question is clear, thank you so much.
How do you add Slovenian stopwords in R? Do you know it maybe? Because I got the message of error in R saying: Error in stopwords("slovenian"): no stopwords available for 'slovenian'.
I would be happy if someone could help.
Best regards, N. A.
Hi, Could you please guide me how to conduct Latent Semantic Analysis through text mining for my business research, any website, book or tutorial videos? so I can apply this method for my research project. Thanks in advance. Kind regards Bushra Aziz
I require some suggestions and need a health insurance dataset where text mining can be possible.Any recent papers addressing dataset can be helpful
I have a data set that contains a text field for approximately more than 3000 records, all of which contain notes from the doctor. I need to extract specific information from all of them, for example, the doctor's final decision and the classification of the patient, so what is the most appropriate way to analyze these texts? should I use information retrieval or information extraction, or the Q and A system will be fine
For unsupervised text clustering, the key thing is the init embedding for text.
If we want to use https://github.com/facebookresearch/deepcluster for text, the problem for text is how to get the init embedding from deep model.
BERT can not get good init embedding.
If we do not use deep model, is there better way to get embedding better than glove wordvec?
Thank you very much.
I would like to make an extractive text summarization dataset by crawling webpages, however, I can't manually annotate it (summarize it), do you know any way to summarize it?
I have some Key Informant Interview (KII) data. I want to apply Natural Language Processing (NLP) to identify the pattern in the data. Can applying NLP for analyzing KII be mentioned as data analytics tools in the report/paper?TIA
I have a research-related question to how can I easily read my results off a Co-occurrence network from VOS Viewer. Please provide any links to articles I can relate to.
Where can I find a reliable source of historical newspapers with lower costs of subscriptions? already tried with the new york times archive but I can't download the newspapers and I need it for a text mining project, any suggestions will be welcome!
I am looking for a software for searching a text in a set of files. Any recommendations?
It should be something similar to the Multi Text Finder.
The aim is to teach students to find an important information in documents.
I’m doing topic model with a collection of technical documents related to the repair of device. The reports are extracted from different softwares from different repair shops. I need to do proper cleaning so model focuses on the key words, specifically I want to automatically remove useless words like:
* Additional findings
* External appereance
* Incoming condition, etc
These "fil-in / template word" are found in almost every document and there are even more others, the documents are collected from different sources and consolidated in one database from which I do the extractions.I already tried segregating by repair shop using tfidf, term frequency, bm25 and segregating by software.
Well, I'm seeking to find "stop words" for my text mining project with the R programming language. My project aims to find the most appropriate keywords for my thesis's systematic literature review, focusing on digital transformation. Can anyone help me through this? Remember that I don't mean Stop-words as very high-frequency words that serve a grammatical purpose though I want to create a list of over-indexed words in the digital transformation field.
My master's research is in info retrieval and text mining. I would be grateful if you could help me to select a good topic for my phd research proposal.
I am looking for a free software for text mining and sentiment analysis for my research on customer review mining (it involves calculating polarity of attributes,opinion oriented information extraction etc)
can somebody suggest if this can be done through NVIVO,is it free ?
also if you have any other suggession
I want to do some text mining of tweets. One of the questions is to understand people's expression of sympathy/empathy. I don't know if there're any ways to quantitatively do this?
Specifically, are there any lexicon dictionaries? For example, for moral foundation theories, there is a dictionary to do the detection. For sentiment analysis, there're also many lexicons or packages to achieve this.
Or, are there any pre-trained models or classifiers that can achieve this job?
Thanks in advance.
We are currently working on a research project that aims at understanding the consumer behaviour for the cultural sector in Quebec-Canada during the COVID-19 crisis. For this reason, we are looking for tools for text mining and multi-language sentiment analysis (English and French) to analyze opinions on social media. We would prefer the cloud-based tools so that our students, who have limited resources and may not have the background in IT, can perform the analysis.
We would appreciate if you could help us to choose the right tool.
Thank you in advance,
1. Can I use ORANGE towards text mining for qualitative research publication? For interview responses?
2. Is it an acceptable methodology?
3. Can you pls refer me to any already published reputed material who used Orange?
I am involved in a project that adds value in visualizing misclassification in the text mining domains. I am wondering whether anyone has experience in formally proofing that the visualisations are in fact aiding the overall data science project outcome.
I want to make an adjacency matrix with citations.
I want to make an index of 130 words and search 130 papers against the 130 words. Manually this is a long process. But I want to automate the searching.
Can anyone suggest if this can be done with text mining or any other ways?
I am working on the answers of the stakeholders in freight transport area and deveopment of crowd logistics solutions. I need to implement text minig. Do you know any other free software than R for text mining?
I have a documents search engine, and the users have ability to Rate the search result for any query they make. For the first versions of Search engine, I am using Universal Sentence Encoder to generate document embeddings, and at the time of search User search queries are also embedded and the documents with most closest embeddings are presented in search result.
User can rate a documents from the search result, on some scale, say 0 to 5 (0 being Not Relevant and 5 is Very relevant)
Using this kind of feedback is there a way we can fine tune the search results?
One idea is using BERT with triplet loss, where we can use:
Anchor : User Search Query
Contradiction : Document which User found Not Relevant
Entailment: Document user found very relevant
Anybody experience in doing this? or any other ideas, suggestions , papers are welcome.
Can anyone make a simple example based on a small database? I need to compute by hand to understand it.
I have attached an example. Please explain it with more details for me.
Thanks a lot
I am interested in a software tool for extracting data from social networks based on geographic characteristics. Purpose - analysis to obtain data on the mood of the population.
I am trying to do text mining on Chinese reviews. I have tried out many softwares, like the RapidMiner, Chinese Text Analytics, Python. Most of them seem to require certain level of programming knowledge. And RapidMiner requires the extension of Hanminer, but I don't know why it is still not working. I found LIWC which seems to be able to analyze Chinese text and I purchased the software. But now, I have a difficulty in segmenting the text using the Standford Segmenter, which again requires some programming works. Any recommendations on how I can do this? Or any recommendations on an easier way of analyzing Chinese reviews? Many thanks!
I am planning to use text mining as a method to collect data from social media. Do you know of any key literature that explains the method?
Want to know about current research trends in Machine learning and Natural language Processing (NLP) - Code-mixed text in detail as soon as possible. This is for a research project of (theoretical) computer science. Thanks in advance.
I want to make sure that whether the model is overfitting or not! My study focused on unstructured tweets. I labelled the tweets with TextBlob and I used LinearSVC to get the classification evaluation. the model accuracy is 98%. Now I doubt it is overfitting! is that normal that much high accuracy or what might be my mistake?
I am currently working on grade automation for open questions using machine learning algorithms/natural language processing/text mining etc. for university assessment purposes. My main focus is on the importance of question types on the performance of algorithms and their relation (for instance how to design questions to improve understandably for students which in turn result in more clear answers for automatic grading). I am looking for available literature focusing on the question types. Does anyone has any suggestions for available literature?
I have the following situation: I have a paper X about topic Y. For paper X I did a forward search with Web of Science (checking all new papers which cite paper X). Then I have downloaded all articles I have identified via forward search (approx. 1'000 Papers). Now I would like to sort these papers according to the frequency of specific keywords used.
For example: I have found paper Z via forward search (so paper Z cites paper X which is about topic Y). Now I want to check if paper Z is also concerned about topic Y or if it just refers to it in passing. For that I search for specific keywords which correspond to topic Y. According to the frequency of the specific keywords mentioned in paper X, I want to classify it in the category "relevant" or "not relevant". Now, how can I determine the threshold for the keywords? That is, if paper X only uses the specific keyword once it is most probably not relevant to topic Y. But if it mentions the specific keyword 20 times it is probably relevant for topic Y.
Is there a recognized methodology to determine or approximate a threshold for the keyword frequency which allows to distinguish if a paper is relevant to topic Y or not?
With this approach I hope to reduce the 1'000 papers to those which are about topic Y.
i want to give my master students research topics related to Text mining especially regarding language processing so anyone will guide me in this regard?
I need your help regarding the Artificial Intelligence Context of Information Retrieval tools and Big Data & Data Mining in the libraries? Dissertation/Thesis, research paper, conference Paper, Book chapter, Research Project and Article can you share with me. I will also welcome you comments, thought and feed back in the context of University libraries support me to designed my PhD Questionnaire.
I would like to find an efficient way to perfom text mining methods and topic modeling on scientific publications. So far I have not been able to find a solution to my problem of making the texts available for processing in RStudio. Is there an easy way to form a corpus comprised of large numbers of text documents, e.g. .pdf files? Or is it even an R package of some sort that allows getting the texts directly from databases like Web of Science?
Any help, advice, tips, tricks and hints are highly appreciated. Thank you very much in advance!
Queries for search engines (such as google) contains some information about the intent and interest of the user, which can be used for user profiling, recommendation etc. As far as I know, there are already lots of methods to deal with relatively long texts, such as news, articles, essays, and extract useful features from them. However, queries are usually too short and may related to many different areas, I wonder if there are some advanced methods (not simple word embedding) already verified to be effective in extracting information from query texts? Thanks !!!
I'm just currently confused with what you are discussing, you would need to be more specific and more concise when talking about text mining and opinion mining.
I am preparing a paper on a biometric study about nursing informatics and I am interested in similar studies published or not published.
what is Contextual and Non-Contextual Features selection and how we use the Contextual feature selection? how it is use in text mining.?
I am thinking of creating a search engine to help people find a movie or similar movie based on the snippets of the story. For instance, if a user type in "movie about dog waiting a long time for his owners to come back", the result should return "Hachiko" , "Eight below", "Lassie" etc. However, it would have been better if we can use data mining method to actually search it based on the plot of the movie not keywords. What is the best solution for this work?
hi respected fellows, please help me to collect data which is ethically right to do some text analysis. i am intending to collect google reviews for "google home" for text mining to extract factors. please help me to identify a method to collect customer reviews for analysis.
Hi there ! I'm loking for a tools that is able calculate a similarity score for each term of 2 list name and give me the top 10 similarity score he found. For exemple List 1 : (-)-epigallocatechin (+)-catechin (pyro)catechol sulfate 3',4'-Dimethoxyphenylacetic acid List 2 : 3',3'-Dimethoxy-phenylacetic acid catechin (epi)gallocatechin catechol sulfate Results (-)-epigallocatechin VS (epi)gallocatechin - Score = 0.9 (very similar) (-)-epigallocatechin VS (+)-catechin - Score = 0.5 (+)-catechin VS catechin - Score = 0.9 etc etc ... Thanks a lot for your great help.
Does NLP and NVivo softwares do the same operation or how they differ from each other? According to the following file that has used NLP for construction site accident analysis? Can we do the same using Nvivo software?
As a developer how could access Microsoft Web n grams service in development
I am working on quite some papers targeting organizational culture, corporate values and leadership. Traditional those topics have been widely researched by using either questionnaires or interviews. The limit in cases being covered as well as the often missing link to the companies (disclosure) was motivation for me to explore the options of text mining and NLP as tools for cultural research. I wonder what other think about this and if some of you have experience in this.
Otherwise if from interest I am ofc happy to share my knowledge and some of my paper drafts of published wok in this field.
Most of the proposed algorithms concentrate on neighboring concepts (events), like "enter restaurant" --> "wait for waiter", but I have trouble finding papers on generating / retrieving longer scripts (I am not talking about narrative cloze task) which are evaluated for commonness.
What are the available benchmarks for evaluating semantic textual similarity approaches?
I am aware of the following:
- SemEval STS
- Microsoft Research Paraphrase Corpus
- Quora Question Pairs
Do you use other that these in your research?
Dear colleagues, I would like to generate a summary of all packages in R which can be used for big data research (data mining, web crawling, machine learning, text mining, social media analysis, neural networks, you name it).
It would be fantastic, if we can create a huge list
a) of the name of packages
b) a short summary what the packages does
c) references to tutorial (beyond the standard CRAN description).
I would like to ask:
1. "What are the different approaches available to find character based, word based or line based similarities and differences among multiple text document?"
2. "Is there any open source library or source code available, which can help in identifying word, character or line based similarities and differences among multiple text documents?" The required library should not only provide me similar strings, but also provide me the exact location.
Please let me know about it, I would be thankful to you.
I tend to reach out when I'm fairly clueless about something, this time no exception. Some background first though.
I research in Thailand and Australia, Thailand never being easy. Business managers tend not to be helpful. Why say yes and create risk, so possible loss of face (Thailand is a heavily face-based culture) when you can ignore or say "No"? Sometimes, though, the situation doesn't pan out that way at all. Currently, I have a number of business managers happy to help, saying "Yes". But there's an issue - it's low (green) season. There's a real paucity of clients at cookery schools for me to interview.
I sat reading cookery school reviews. One in particular had 169 reviews on Google. I kept seeing the words fun, funny and laugh. Over and over - suggesting people attend for fun. Equally I saw little comment around gaining cooking skills, very little. So, had I found my answer as to why people attend touristic cooking classes? Do I have to interview people at all? Why not just text-mine the reviews?
In fact, over the last day or two things have picked up, interviews nearly finished. But I'm still fascinated by text-mining, if only to capture a school's reviews for comparison against questionnaire responses.
My two questions are:
1. Do readers find text-mining a viable approach to inferentially discovering consumer motivations in the way I've said?
2. How acceptable do readers feel text-mining to be in academia, as opposed to marketing? I'm not sure I've even seen text-mining used, as referenced in an academic article. An exception is a Russian friend who is a big user, which might suggest that there are national differences on this?
How to retrieve location of tweets even though the location of users were turned off and what are features through which we find location?Any work on it till now?
I am student of MS(CS). I am in searching of top research problems in sentiment analysis for my MS thesis.Kindly guide me.Thank you
I have annotated my dataset by using POS, chunking, and wordcase. If I include dependency parser in my annotated dataset, will it help to define more features for the classification of Movie Name Entity? In short, will dependency relations improve the performance of the model for Move Reviews dataset? I need to identify movie names and person names from my corpus.
I want to classify the news headline data. I am able to to make corpus , cleaning the data , train the data using SVM ( but only for small data set ) . I am not splitting the data into train and test data inspite i am using different set for test data ( but from headline data only).
I am able to train the model but while testing with test data .
Error: No. of Variables in both are different is coming.
Random forest ( Same Error)
I have tried Naive Bayes ( Accuracy is coming very less aprox 10%)
I've read several times that on the problems of large dimension (Image Recognition, Text Mining, ...), the Deep Learning method gives significantly higher accuracy than the "classical" methods (such as SVM, Logistic Regression, etc.). And what happens on problems of ordinary, medium dimension? Let's say that the data set is on the order of 1,000 ... 10,000 objects and the object is characterized by 10 ... 20 parameters. Are there articles that provide data on the comparison of accuracy indicators (Recall, Precision, ...) by Deep Learning and other methods on some benchmarks?
Thanks beforehand for your answer. Regards, Sergey.
- I would like to know. the best one(s) to use, whether free or proprietary. Thanks much!
I am interested in text mining. I use the clustering techniques to cluster the words in a text. Firstly, I want to select say 200 words frequently used words. Then I have to make a distance matrix and a dendogram considering the selected words. Please suggest me how I do it in R.
Are there any survey papers on word embedding in NLP which covers the whole history of word embedding from simple topics like one-hot encoding to complex topics like w2v model?
Are there any R packages which can be used to mine text data in Malayalam?
Or is there any other FOSS package that can mine Malayalam text data?
I Know that the supervised method is evaluated in terms of precision, recall and f1 measure. Therefore, what evaluation criteria is used for the evaluation of unsupervised method? Can an unsupervised method be evaluated in terms of precision, recall and f1?
I'm looking for a free tool to recognize the terminology concepts in technical domains such as computer science and engineering.
Is there any available dictionary, gold standard or such a tool to do that? why there is no much research in this direction?
Nowadays there are plenty of core technologies for TC (Text Classification). Among all the ML learning approaches, which one would you suggest for training models for a new language and a vertical domain (alike Sports, Politics or Economy)?
I am trying to use the Nlprot, but everytime I try to run I get the same error message:
sh: 1: svm_classify5: not found
Could not open svm_out_1_17802.txt!
The svm is installed, all the paths are checked, but I still can't run the NlProt.
I am working in Text Segmentation Project.I need to build a lexical chain depending on WordNet or some other corpora from plain text.
There is decision Tree algorithm like C 4.5 to implement lexical chain.Being not much skilled in Python ,It's tough for me to manipulate decision tree.Is there any Python Package or Code available for finding lexical chain?
Hi. I have a query regarding Text Classification. I have a list of words with the following attributes. word, weight, class. The class can be positive or negative. Weight is between -1 to 1. How can I train a classifier like SVM using this word list to classify unseen documents? An example in any tool is welcome
I wish to work in this area but not finding enough resources.Suggest me some good journals or site where I can study about this.
I have done twitter sentiment analysis using VADER lexicon but now need to work on some other lexicon in order to do analysis on results.
1. There is a need for my research to create an ontology- both domain specific & in English (for language). Is Protege the best option? What criteria should be kept in mind while creating ontologies?
2. A voluminous text file is given as input, the delimiter is Fullstop(".") i.e. at sentence level analysis has to be done, what would be the best way to keep track of the word order for a sentence?
3. Is there any repository for Unstructured text data (In English language) which can be used for testing? Thanks in advance.
As mentioned in the paper https://nlp.stanford.edu/pubs/glove.pdf, the authors learn two word vectors(one being word vectors W, and another context based vectors W~ ). Why two separate vectors were required and how they are being learned ?
I am beginner in the field of text mining.I have implemented an algorithm on text pattern mining.I have collected few sample of Reuters RCV1 dataset. I know about precision,recall and F-score rather I am confused about how to judge relevance.How I will measure how much relevant pattern it can retrieve?