Science topics: Artificial IntelligenceStatistical Natural Language Processing
Science topic
Statistical Natural Language Processing - Science topic
Explore the latest questions and answers in Statistical Natural Language Processing, and find Statistical Natural Language Processing experts.
Questions related to Statistical Natural Language Processing
Because a human thought is interconnected with a language, what do you think about the Integration of Natural Language Processing (NLP) with Deep Learning (DP)? I think that it is the main way to build General Artificial Intelligence.
What approaches are used in the Integration of NLP with DP? What are trends in this area?
Hello,
I am interested in processing the ARC dataset (http://nlpprogress.com/english/question_answering.html) with the GPT2 double heads model neural network. The dataset (tab delimited) is structured as below:
```
Question <tab> Answer
Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (A) worldwide disease (B) global mountain building (C) rise of mammals that preyed upon plants and animals (D) impact of an asteroid created dust that blocked the sunlight. <tab> D
```
I know that I am supposed to tokenize the dataset before passing it into GPT2 double heads model for doing NLP.How should I tokenize this data? More specifically,
- should I add a special token before each character that denotes for multiple choice options (A), (B), (C) and (D)?
- should I add special token before each string that denotes for the contents of the multiple choice options?
- Am I supposed to add the tokens "<bos>" and "<eos>" at the beginning and at the end of each question statement?
- If I am to pass this data into a GPT2 Double Heads Model (The GPT2 model with two heads) for processing multiple choice questions, what should I do with the part that denotes for an actual answer to the multiple choice question?
So for instance, to generate an input sequence for the GPT2 double heads model, should I break up the original question statement into 4 sequences, 1 for each multiple choice option, and apply the tokenization to each of the 4 sequences as below?:
```
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (A) <spec_token2> worldwide disease <eos>
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (B) <spec_token2> global mountain building <eos>
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (C) <spec_token2> rise of mammals that preyed upon plants and animals <eos>
<bos> Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? <spec_token1> (D) <spec_token2> impact of an asteroid created dust that blocked the sunlight. <eos>
```
Thank you,
PS: I found this site https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313 and it seem to address some of the questions I have, but still this is not a complete help.
I am looking for a elaborate description of the algorithm and method behind the sentiment analysis tool GPOMS (Google Profile of Mood States). If I want to develop such kind of tool myself, then how would I start ?
How can also BERT do entity-level sentiment analysis? Is there an open-source tool based on BERT that does so?
I have trained Word embedding using a "clean" corpus in fastText and and I want to compare the quality of the Word embedding obtained against the word embedding from the pre-trained multi-lingual embedding in BERT which I perceive(discovered) to be trained on a "very-noisy" corpus(wiki).
Any Suggestions or Ideas on how to go about evaluating/comparing the performance would be appreciated.
Most of the proposed algorithms concentrate on neighboring concepts (events), like "enter restaurant" --> "wait for waiter", but I have trouble finding papers on generating / retrieving longer scripts (I am not talking about narrative cloze task) which are evaluated for commonness.
Dear All,
I have been analyzing data from surveys collected on Amazon M-Turk for the last year and a lot of the times it is obvious (and understandable) that people do a pretty awful job at responding. I can completely understand that a lot of the times people will be tired, drunk or stoned, and will be filling in surveys to make ends meet, but I need to find a widely accepted way of dealing with these responses so they don't add noise to the results.
I come from a neuroscience/pychophysics background where I had loads of freedom with cleaning data (as long as I did it transparently), but now in Consumer Research & Marketing a justified but somewhat arbitrary cleaning of the data is less accepted, both in terms of the reports I produce and the journals I am targeting.
I have an open question at the end of the survey, for ethical reasons, where I ask people what they think the purpose of the study was. These are some of the responses I get (real responses):
- NOTHING
- i ave impressed
- no
- NOTHING FOR LIKE THAT UNCLEAR. IT'S ALMOST FOR SATISFIED.
Clearly one cannot expect anything from a respondent that answers in such a way, and, in fact, when I eliminate such respondents the results make much more sense. I have already set my sample to US residents only, and stated I want English speakers. But linguistically impaired or non-English speakers seem to wriggle their way in.
What do you advise me to do? What is acceptable in science and business, in terms of dealing with random, bad, non-sensical responses?
Some people tell me that they eliminate up to 50% of data from M-turk because it is crappy, and that is normal to them. Other people say that is unacceptable. The people who eliminate up to 50% of data seem to not report it. I would like to have a reasonable procedure that most reasonable people would see as acceptable, and report it.
I am thinking about investing time creating a little program that processes English language and that detects text that cannot be considered as functional, grammatically-sound English statements. Is that something someone has tried?
Lastly, I have heard about an elusive statistical procedure that detects random responses, when rating items on a 5 or 7 point scale. I cannot find anything concrete on this, which makes me think its not widely accepted or well-known or generalizable.
Any tips or thoughts on the matter will be well appreciated.
Michael
Dear friends,
We are looking for a speech recognition software that will allow us to automatically code the voice of multiple participants talking in alternating turns.
Basically, we expect the software to
1) differentiate voice of participants from each other and assign a different code to each participant,
2) provide a timeline output that will allow us to quantify the duration and frequency of talk for each participant.
A hypothetical coding output would be like this:
ParticipantA:1
ParticipantB:2
ParticipantC:3
No talk:0
Timeline------------------------->>
2222220111111000000022222233333330000011111000000222
If i have to get the most possible generic steps of text analytics, what are the most commonly used steps for any text analysis model.
Any help and your expert guidance/ suggestions are welcome.
Thanks in advance
I want to know about the best Arabic named entity tools available and how to use them?
Thanks in advance
Extracting inflectional paradigms from raw corpus using machine learning or any other techniques
I am looking for a dataset which contains a list of phrases and the emotion associated with it. For example for x="what the hell just happened", y='surprise'.
and, x="no one loves me",y='sad' etc.
Please its kind of urgent.
Thank you.
The features looks like the image file attached, how can I convert this text file into an arff format for Weka using purpose?
Are any tools done that?
Are any technique?
What are the steps to separate each feature into one column?
What is the best way right now to measure the text similarity between two documents based on the word2vec word embeddings?
We used word2vec to create word embeddings (vector representations for words).
Now we want to use these word embeddings to measure the text similarity between two documents.
Which technique it the best right now to calculate text similarity using word embeddings?
Thanks.
I am working on a project where I need to calculate the perplexity or cross-entropy of some text data. I have been using MITLM, but it does not seem to be very well documented. Can anyone suggest an alternative of MITLM? Thanks!
I have small corpus of software projects domain. I need vector representation of words. Word2Vec gives good results but it requires big corpus, I thought about using pre trained models such as Google News word2vec model, but It is a general model not adapted for specific domains. So, I need a tool or technique that allows to derive a model adapted to my domain. Is there any solution?
from last many years statistical language models having great focus of research in NLP tasks. But apart from these language models what are other types of models that were/are used for NLP tasks.?
I need some normative data on the semantic relation between categories. Particularly, I need coordinate values for these categories in a semantic space. The best I can find on this topic so far is Cree & McCrae (2003). They offer distance calculations between categories. However, I need the coordinate locations of each of the categories in the space they are projected onto.
For instance, the semantic difference between the categories herbivore and insect might be represented as a cosine of 5. However, I would need to know that the location of insect in the space it is projected is 3 and for herbivore it is 12.
Thanks!
Brandon
I am trying to develop software to get suitable attributes for entities names depending on entity type.
For example if I have entities such doctor, nurse, employee , customer, patient , lecturer , donor, user, developer, designer, driver, passenger and technician, they all will have attributes such as name, sex, date of birth, email address, home address and telephone number because all of them are people.
Second example word such as university, college, hospital, hotel and supermarket can share attributes such as name, address and telephone number because all of them could be organization.
Are there any Natural Language Processing tools and software could help me to achieve my goal. I need to identify entity type as person or origination then I attached suitable attributes according to the entity type?
I have looked at Name Entity Recognition (NER) tool such as Stanford Name Entity recognizer which can extract Entity such as Person, Location, Organization, Money, time, Date and Percent But it was not really useful.
I can do it by building my own gazetteer however I do not prefer to go to this option unless I failed to do it automatically.
Any helps, suggestions and ideas will be appreciated.
I am currently working on natural language processing project in python, i doubt can i use some fuzzy principle in taking a probability decision in NLP
Hi there,
I am aiming to study multi-dimension sustainability and well-being (SaW) as a unified subject matter. Instead of conducting traditional review of literature, I am interested to apply some sophisticated method (e.g. LDA) to scientifically establish connections between SaW. Please share your thoughts on how practical is it to address this topic modeling using LDA? If somebody is interested to collaborate on one of my papers, please let me know.
I am interested in solving different dynamic models, but I have not selected which platform I will use yet. I thought of Python because of is a free soft, but suggestion will be accepted.
More over, is it possible to model a bilevel problem in Python?
How can i reach to a survey or a literature review about the latest researches about the methods of representing the words in language using continuous vectors in the continuous space ?
This from the perspective of the language modeling , as i want to build a statistical language model in the continuous space.
Dear all,
I had an idea about a data structure that realises a string-to-something map functionality but is not a usual (hash-)map. After implementing it in Python, I could get some first proves that it's faster than a map - the sizes of the respective structures still have to be compared.
But: I am not working on data structures and/or algorithms. I do not even know whether something like this is already described in scientific literature or not. Where can I find some more information about this topic or someone who works on this? Good keywords or names of data structures/algorithms that realise similar functionality would be enough.
Sincerely,
Daniel
I would like to know if TF-IDF is domain independent or not. In more detail, I have dataset contains documents about "disease outbreaks," and "smart technology". then compute:
1. TF-idf for all docs in whole dataset. and save values of each term.
2. Compute tf-idf only for "smart technology" Dcs. and save values of each term.
My question is, if I use any of clustering or classification algorithms for the "smart technology" Doc, Is there any effects to the results When Using tf idf values in 1 or 2?
Thanks.
The project is as follows:
To build a processing system that checks spelling and diction and identifies parts of speech. The system should compute a fog index for the manuscript and provide summary information such as average sentence length, number and percent of compound sentences, use of commas, number of paragraphs etc
Once done, we'll attempt to simplify the language to a bunch of young readers if readability caters to a higher level of readers.
Constraints:
The project is hoped to be done by a group of two final year CS engineering students.
We know Java and can build our knowledge on it.
We can learn Prolog during December and the project has to be completed by April-May.
Please help us know if it's possible and wise enough to venture into it, us being two amateurs making an effort to know more each day.
I need to generate a Penn Treebank, from POS tagged corpus.
Please let me know if there is any tool for it, which is language independent.
I'm looking at this problem of trying to identify users asking a question we already have an answer for, in a way that we are not accounting for. For example, I might have a response for a question like "what is the population of brazil?", but the user might ask "how many people live in brazil?" or "what's the number of inhabitants in brazil?". I would like to be able to classify those questions as equivalent.
Any suggestions on some papers, research material I should look into, or maybe hints on what things I should try?
I am looking for a tool which is source language independent and which can generate PCFG from a plain/annotated corpus.