Science topic

Statistical Natural Language Processing - Science topic

Explore the latest questions and answers in Statistical Natural Language Processing, and find Statistical Natural Language Processing experts.
Questions related to Statistical Natural Language Processing
  • asked a question related to Statistical Natural Language Processing
Question
17 answers
Because a human thought is interconnected with a language, what do you think about the Integration of Natural Language Processing (NLP) with Deep Learning (DP)? I think that it is the main way to build General Artificial Intelligence.
What approaches are used in the Integration of NLP with DP? What are trends in this area?
Relevant answer
Answer
Dear Amin Honarmandi Shandiz , thank you for your contribution. It is very interesting paper. On the other hand, the Integration of Vision and Language Processing is only one part of the way to implementation into AI the understanding of meaning.
  • asked a question related to Statistical Natural Language Processing
Question
1 answer
Hello,
I am interested in processing the ARC dataset (http://nlpprogress.com/english/question_answering.html) with the GPT2 double heads model neural network. The dataset (tab delimited) is structured as below:
```
Question <tab> Answer
Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (A) worldwide disease (B) global mountain building (C) rise of mammals that preyed upon plants and animals (D) impact of an asteroid created dust that blocked the sunlight. <tab> D
```
I know that I am supposed to tokenize the dataset before passing it into GPT2 double heads model for doing NLP.How should I tokenize this data? More specifically,
  1. should I add a special token before each character that denotes for multiple choice options (A), (B), (C) and (D)?
  2. should I add special token before each string that denotes for the contents of the multiple choice options?
  3. Am I supposed to add the tokens "<bos>" and "<eos>" at the beginning and at the end of each question statement?
  4. If I am to pass this data into a GPT2 Double Heads Model (The GPT2 model with two heads) for processing multiple choice questions, what should I do with the part that denotes for an actual answer to the multiple choice question?
So for instance, to generate an input sequence for the GPT2 double heads model, should I break up the original question statement into 4 sequences, 1 for each multiple choice option, and apply the tokenization to each of the 4 sequences as below?:
```
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (A) <spec_token2> worldwide disease <eos>
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (B) <spec_token2> global mountain building <eos>
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (C) <spec_token2> rise of mammals that preyed upon plants and animals <eos>
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (D) <spec_token2> impact of an asteroid created dust that blocked the sunlight. <eos>
```
Thank you,
PS: I found this site https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313 and it seem to address some of the questions I have, but still this is not a complete help.
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
I am looking for a elaborate description of the algorithm and method behind the sentiment analysis tool GPOMS (Google Profile of Mood States). If I want to develop such kind of tool myself, then how would I start ?
Relevant answer
Answer
You can find a great explanation on this book:
  • asked a question related to Statistical Natural Language Processing
Question
2 answers
How can also BERT do entity-level sentiment analysis? Is there an open-source tool based on BERT that does so?
Relevant answer
Answer
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
I have trained Word embedding using a "clean" corpus in fastText and and I want to compare the quality of the Word embedding obtained against the word embedding from the pre-trained multi-lingual embedding in BERT which I perceive(discovered) to be trained on a "very-noisy" corpus(wiki).
Any Suggestions or Ideas on how to go about evaluating/comparing the performance would be appreciated.
Relevant answer
Answer
It is best to test for your task. If you are doing text classification, I would recommend starting with an AUC assessment. If the entity recognition is non-zero F1.
  • asked a question related to Statistical Natural Language Processing
Question
4 answers
Most of the proposed algorithms concentrate on neighboring concepts (events), like "enter restaurant" --> "wait for waiter", but I have trouble finding papers on generating / retrieving longer scripts (I am not talking about narrative cloze task) which are evaluated for commonness.
Relevant answer
Answer
Arturo Geigel , thank you for the suggestion. I've been working on computational creativity for some time, but the problem is that e.g. poetry (less in case of humor) allow too much freedom, whereas commonsense behavior patterns (even if unlimited in quantity) are stricter and shared by most of people which makes them hard to generate, especially because they are not expressed in one chunk of text, they must be "glued together" from pieceses scattered in various texts. My guess is that before creating something original, you need to know what is common (and boring in a sense).
  • asked a question related to Statistical Natural Language Processing
Question
6 answers
Dear All,
I have been analyzing data from surveys collected on Amazon M-Turk for the last year and a lot of the times it is obvious (and understandable) that people do a pretty awful job at responding. I can completely understand that a lot of the times people will be tired, drunk or stoned, and will be filling in surveys to make ends meet, but I need to find a widely accepted way of dealing with these responses so they don't add noise to the results.
I come from a neuroscience/pychophysics background where I had loads of freedom with cleaning data (as long as I did it transparently), but now in Consumer Research & Marketing a justified but somewhat arbitrary cleaning of the data is less accepted, both in terms of the reports I produce and the journals I am targeting.
I have an open question at the end of the survey, for ethical reasons, where I ask people what they think the purpose of the study was. These are some of the responses I get (real responses):
- NOTHING
- i ave impressed
- no
- NOTHING FOR LIKE THAT UNCLEAR. IT'S ALMOST FOR SATISFIED.
Clearly one cannot expect anything from a respondent that answers in such a way, and, in fact, when I eliminate such respondents the results make much more sense. I have already set my sample to US residents only, and stated I want English speakers. But linguistically impaired or non-English speakers seem to wriggle their way in.
What do you advise me to do? What is acceptable in science and business, in terms of dealing with random, bad, non-sensical responses?
Some people tell me that they eliminate up to 50% of data from M-turk because it is crappy, and that is normal to them. Other people say that is unacceptable. The people who eliminate up to 50% of data seem to not report it. I would like to have a reasonable procedure that most reasonable people would see as acceptable, and report it.
I am thinking about investing time creating a little program that processes English language and that detects text that cannot be considered as functional, grammatically-sound English statements. Is that something someone has tried?
Lastly, I have heard about an elusive statistical procedure that detects random responses, when rating items on a 5 or 7 point scale. I cannot find anything concrete on this, which makes me think its not widely accepted or well-known or generalizable.
Any tips or thoughts on the matter will be well appreciated.
Michael
Relevant answer
Answer
If you are working in a KAP (Knowledge-Attitude-Perception) framework, it makes sense to use your Knowledge questions to rank response adequacy. From there on, if you want to proceed with an outlier analysis followed by normalization of some kind, or go with a weighted scheme is up to you.
You can detect (/avoid) random answers by incorporating 'validation questions' (rephrasing of questions asked previously). If respondents answer both the original question and its validation counterpart consistently you're good; if not you have grounds to suspect they are not doing their best.
Optional questions means you already have a way to handle imbalanced data. Not all respondents will opt to answer. It's similar with open questions with no word limit.
Using linguistic criteria doesn't seem appropriate, unless the survey itself is linguistics-oriented.
You can always report versions of your analysis on both 'as-is' and 'clean' data.
  • asked a question related to Statistical Natural Language Processing
Question
4 answers
Dear friends,
We are looking for a speech recognition software that will allow us to automatically code the voice of multiple participants talking in alternating turns.
Basically, we expect the software to
1) differentiate voice of participants from each other and assign a different code to each participant,
2) provide a timeline output that will allow us to quantify the duration and frequency of talk for each participant.
A hypothetical coding output would be like this:
ParticipantA:1
ParticipantB:2
ParticipantC:3
No talk:0
Timeline------------------------->>
2222220111111000000022222233333330000011111000000222
Relevant answer
Answer
You may use Raspberry Pi and Zigbee... [1]
or, MacSpeech® Dictate speech recognition software [2]
or, mobile application [3]
Reference:
[1] Younis, S. A., Ijaz, U., Randhawa, I. A., & Ijaz, A. (2018). Speech Recognition Based Home Automation System using Raspberry Pi and Zigbee. NFC IEFR Journal of Engineering and Scientific Research, 5.
[2[ Hon-Anderson, E. (2018). U.S. Patent No. 9,865,263. Washington, DC: U.S. Patent and Trademark Office.
[2] Kovacs, M. D., Cho, M. Y., Burchett, P., & Trambert, M. (2018). Benefits of Integrated RIS/PACS/Reporting Due To Automatic Population of Templated Reports. Current Problems in Diagnostic Radiology.
  • asked a question related to Statistical Natural Language Processing
Question
9 answers
If i have to get the most possible generic steps of text analytics, what are the most commonly used steps for any text analysis model.
Any help and your expert guidance/ suggestions are welcome.
Thanks in advance
Relevant answer
Answer
Adding to the above, if your approach involves NLP at the pre-processing step, there are several sub-tasks in NLP which are generally represented as a sequential chain/pipeline performed other your input items. These tasks go from low-level operations (tokenization, stopword removal, statistical analysis like TF-IDF) to higher level ones (WSD, Coreference detection, NER...). Quick search with "NLP chain" will give you examples and frameworks that suits your needs.
From this intermediate data representation you can build the analytics tasks described in previous answers (data modeling, clustering/classifiation, visualization...).
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
I want to know about the best Arabic named entity tools available and how to use them?
Thanks in advance
Relevant answer
Answer
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
Extracting inflectional paradigms from raw corpus using machine learning or any other techniques
Relevant answer
Answer
It's an ordered set of Regular Expression rules, exceptions come first, general rules last.
  • asked a question related to Statistical Natural Language Processing
Question
15 answers
I am looking for a dataset which contains a list of phrases and the emotion associated with it. For example for x="what the hell just happened", y='surprise'.
and, x="no one loves me",y='sad' etc.
Please its kind of urgent.
Thank you.
Relevant answer
Answer
  • asked a question related to Statistical Natural Language Processing
Question
5 answers
The features looks like the image file attached, how can I convert this text file into an arff format for Weka using purpose?
Are any tools done that?
Are any technique?
What are the steps to separate each feature into one column?
Relevant answer
Answer
Please consult with a person who is expert in documentation work and you Latex for that.
  • asked a question related to Statistical Natural Language Processing
Question
18 answers
What is the best way right now to measure the text similarity between two documents based on the word2vec word embeddings?
We used word2vec to create word embeddings (vector representations for words).
Now we want to use these word embeddings to measure the text similarity between two documents.
Which technique it the best right now to calculate text similarity using word embeddings?
Thanks.
Relevant answer
Answer
Here is an interesting work was done by Matt Kusner :
If you want to use the cosine distance, averaging word vectors (trained using word2vec) for document embedding may help you: 
  • asked a question related to Statistical Natural Language Processing
Question
7 answers
I am working on a project where I need to calculate the perplexity or cross-entropy of some text data. I have been using MITLM, but it does not seem to be very well documented. Can anyone suggest an alternative of MITLM? Thanks!
Relevant answer
Answer
SRILM is quite handy and well-documented. FAQ explains how to compute ppx with ngram tool http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html
Another really short and handy explanation about ppx with SRILM was described here  http://cmusphinx.sourceforge.net/wiki/tutoriallm
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
I have small corpus of software projects domain. I need vector representation of words. Word2Vec gives good results but it requires big corpus, I thought about using pre trained models such as Google News word2vec model, but It is a general model not adapted for specific domains. So, I need a tool or technique that allows to derive a model adapted to my domain. Is there any solution?
Relevant answer
As Kwan-yuet says, you can use bag-of words. Moreover, if you want to add some context without having to use Word2Vec, you can create vectors of n-grams (unigrams, bigrams, skip-bigrams, trigrams, etc). In that case, you can know in what context a certain word was used in the corpus.
  • asked a question related to Statistical Natural Language Processing
Question
7 answers
from last many years statistical language models having great focus of research in NLP tasks. But apart from these language models what are other types of models that were/are used for NLP tasks.? 
Relevant answer
Answer
Distribution semantic model and computational distribution semantic model are also which comes under statistical model with statistical semantic model where statistical model uses machine learning from artificial intelligence and distribution model uses computational which have some parameters and rules.
  • asked a question related to Statistical Natural Language Processing
Question
4 answers
I need some normative data on the semantic relation between categories. Particularly, I need coordinate values for these categories in a semantic space. The best I can find on this topic so far is Cree & McCrae (2003). They offer distance calculations between categories. However, I need the coordinate locations of each of the categories in the space they are projected onto.
For instance, the semantic difference between the categories herbivore and insect might be represented as a cosine of 5. However, I would need to know that the location of insect in the space it is projected is 3 and for herbivore it is 12.
Thanks!
Brandon
Relevant answer
Answer
Peter Gärdenfors has done research on that topic.
Search for 'geometry of thought' and 'conceptual spaces'.
I'm using a semantic net for associative search
in my databases. Using the shortest path for
semantic distance is not the best idea since
a shortest path may lead through a vertex which
happens to have hundreds of connections.
Instead I use random walks and use the logarithm
of visit rank as a measure for semantic distance.
Networks come along with their own space.
Coordinates of a vertex v in the net could be defined
as vector of distances to some fixed chosen base
vertices v0, v1, v2...
It is also possible to embed the network topology
into Euklidean space, 2D (Cortex) or 3D. This could
be done with Kohonen learning, autoencoder, or
an evolutionary algorithm. It should be mentioned
that the brain has a grid architecture where each
neuron has its address as 3D coordinates. This has
been found by means of diffuctions tensor imaging.
But in reality a symbol's representation might
be distributed or encoded in waves.
Literature:
Regards,
Joachim
  • asked a question related to Statistical Natural Language Processing
Question
6 answers
I am trying to develop software to get suitable attributes for entities names depending on entity type.
For example if I have entities such doctor, nurse, employee , customer, patient , lecturer , donor, user, developer, designer, driver, passenger and technician, they all will have attributes such as name, sex, date of birth, email address, home address and telephone number because all of them are people.
Second example word such as university, college, hospital, hotel and supermarket can share attributes such as name, address and telephone number because all of them could be organization.
Are there any Natural Language Processing tools and software could help me to achieve my goal. I need to identify entity type as person or origination then I attached  suitable attributes according to the entity type?
I have looked at Name Entity Recognition (NER) tool such as Stanford Name Entity recognizer which can extract Entity such as Person, Location, Organization, Money, time, Date and Percent But it was not really useful.
I can do it by building my own gazetteer however I do not prefer to go to this option unless I failed to do it automatically.  
Any helps, suggestions and ideas will be appreciated.  
Relevant answer
Answer
Mussa,
This might not be a very helpful answer, but from my understanding NLP techniques often rely on context to understand what is being discussed.  So a single word like "doctor" is very difficult to understand unless it is in some kind of context like "a doctor treats sick people".  From the sentence, an NLP machine might recognize that doctor is a noun and might infer something about relating to people.  Without this context, it will be tough to discern the categorical differences between single words. 
It might be less complicated (although more time-consuming) to create a predefined list of terms that you would like to classify and then simply match words to those lists in order to create your associated list of features for a given entity.
Hope that helps.
Sean
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
I am currently working on natural language processing project in python, i doubt can i use some fuzzy principle in taking a probability decision in NLP
Relevant answer
Answer
No
  • asked a question related to Statistical Natural Language Processing
Question
4 answers
Hi there, 
I am aiming to study multi-dimension sustainability and well-being (SaW) as a unified subject matter. Instead of conducting traditional review of literature, I am interested to apply some sophisticated method (e.g. LDA) to scientifically establish connections between SaW. Please share your thoughts on how practical is it to address this topic modeling using LDA? If somebody is interested to collaborate on one of my papers, please let me know.
Relevant answer
Answer
I would probably use a different set of documents as a validation set. But I am just guessing. I agree that generating a brief answer to the question "are corpora A and B related to the same topic" is not straightforward. In some sense, you could see it as an "author verification" problem, but, usually, in that case style matters more than content.
  • asked a question related to Statistical Natural Language Processing
Question
5 answers
I am interested in solving different dynamic models, but I have not selected which platform I will use yet. I thought of Python because of is a free soft, but suggestion will be accepted.
More over, is it possible to model a bilevel problem in Python?
Relevant answer
Answer
You may refer to "nltk.org " that also has a supporting book, Natural Language Processing with Python (Oreilly Publishers). The site would prove to be a very useful to use packages for basic NLP tasks.
  • asked a question related to Statistical Natural Language Processing
Question
6 answers
How can i reach to a survey or a literature review about the latest researches about the methods of representing the words in language using continuous vectors in the continuous space ? 
This from the perspective of the language modeling , as i want to build a statistical language model in the continuous space.
Relevant answer
Answer
Google's word2vec is an easy-to-use toolkit to represent words as vectors in a continuous space (https://code.google.com/p/word2vec/).
The earlier reference is Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.
Actually this idea came from language modeling, and you can check Mikolov's publications about RNNLM.
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
Dear all,
I had an idea about a data structure that realises a string-to-something map functionality but is not a usual (hash-)map. After implementing it in Python, I could get some first proves that it's faster than a map - the sizes of the respective structures still have to be compared.
But: I am not working on data structures and/or algorithms. I do not even know whether something like this is already described in scientific literature or not. Where can I find some more information about this topic or someone who works on this? Good keywords or names of data structures/algorithms that realise similar functionality would be enough.
Sincerely,
Daniel
Relevant answer
Answer
  • asked a question related to Statistical Natural Language Processing
Question
5 answers
I would like to know if TF-IDF is domain independent or not. In more detail, I have dataset contains documents about "disease outbreaks," and "smart technology". then compute:
1. TF-idf for all docs in whole dataset. and save values of each term.
2. Compute tf-idf only for "smart technology" Dcs. and save values of each term.
My question is, if I use any of clustering or classification algorithms for the "smart technology" Doc, Is there any effects to the results When Using tf idf values in 1 or 2?
Thanks.
Relevant answer
Answer
There will be a strong effect. You should always use documents from your target domain, i.e. if you cluster/classify on "smart technology" you should use TF/IDF from "smart technology".
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
The project is as follows:
To build a processing system that checks spelling and diction and identifies parts of speech. The system should compute a fog index for the manuscript and provide summary information such as average sentence length, number and percent of compound sentences, use of commas, number of paragraphs etc
Once done, we'll attempt to simplify the language to a bunch of young readers if readability caters to a higher level of readers.
Constraints: 
The project is hoped to be done by a group of two final year CS engineering students.
We know Java and can build our knowledge on it.
We can learn Prolog during December and the project has to be completed by April-May.
Please help us know if it's possible and wise enough to venture into it, us being two amateurs making an effort to know more each day.
Relevant answer
Answer
Dear Tryphena,
It depends on the goals of your project. Since you mentioned that you are final year CS engineering. I guess your goal is to master the techniques and methods introduced to you during the study time. So, you can see the documentation os Stanford NLP group.
However, if you want to delve into a research work. You should not start from scratch. Instead, Stanford NLP packages  offer most of the features you have mentioned. You can call it from Java.  
  • asked a question related to Statistical Natural Language Processing
Question
1 answer
I need to generate a Penn Treebank, from POS tagged corpus.
Please let me know if there is any tool for it, which is language independent.
Relevant answer
Answer
Treetagger seems to be language independent...
i tried stanford parser and brill tagger for english only... but each has its own limitations....
  • asked a question related to Statistical Natural Language Processing
Question
3 answers
I'm looking at this problem of trying to identify users asking a question we already have an answer for, in a way that we are not accounting for. For example, I might have a response for a question like "what is the population of brazil?", but the user might ask "how many people live in brazil?" or "what's the number of inhabitants in brazil?". I would like to be able to classify those questions as equivalent.
Any suggestions on some papers, research material I should look into, or maybe hints on what things I should try?
Relevant answer
Answer
Take a look at :
Text Relatedness Based on a Word Thesaurus by George Tsatsaronis, Iraklis Varlamis, Michalis Vazirgiannis
You might find it useful
  • asked a question related to Statistical Natural Language Processing
Question
1 answer
I am looking for a tool which is source language independent and which can generate PCFG from a plain/annotated corpus.
Relevant answer
Answer