Science topic

# Natural Language Processing - Science topic

Natural Language Processing is a computer processing of a language with rules that reflect and describe current usage rather than prescribed usage.
Questions related to Natural Language Processing
• asked a question related to Natural Language Processing
Question
I am trying to make generalizations about which layers to freeze. I know that I must freeze feature extraction layers but some feature extraction layers should not be frozen (for example in transformer architecture encoder part and multi-head attention part of the decoder(which are feature extraction layers) should not be frozen). Which layers I should call “feature extraction layer” in that sense? What kind of “feature extraction” layers should I freeze?
No problem Muhammedcan Pirinççi I am glad it helped you.
In my humble opinion, first, we should consider the difference between transfer learning and fine-tuning and then decide which one better fits our problem. In this regard, I found this link very informative and useful: https://stats.stackexchange.com/questions/343763/fine-tuning-vs-transferlearning-vs-learning-from-scratch#:~:text=Transfer%20learning%20is%20when%20a,the%20model%20with%20a%20dataset.
Afterward, when you decide which approach to use, there are tons of built-in functions and frameworks to do such for you. I am not sure if I understood your question completely, however, I tried to talk about it a little bit. If there is still something vague to you please don't hesitate to ask me.
Regards
• asked a question related to Natural Language Processing
Question
Hello, I am interested converting word numerals to numbers task, e.g
- 'twenty two' -> 22
- 'hundred five fifteen eleven' -> 105 1511 etc.
And the problem I can't understand at all currently is for a number 1234567890 there are many ways we can write this number in words:
=> 12-34-56-78-90 is 'twelve thirty four fifty six seventy eight ninety'
=> 12-34-576-890 is 'twelve thirty four five hundred seventy six eight hundred ninety'
=> 123-456-78-90 is '(one)hundred twenty three four hundred fifty six seventy eight ninety'
=> 12-345-768-90 is 'twelve three hundred forty five seven hundred sixty eight ninety'
and so on (Here I'm using dash for indicating that 1234567890 is said in a few parts).
Hence, all of the above words should be converted into 1234567890.
I am reading following papers in the hopes of tackling this task:
But so far I still can't understand how would one go about solving this task.
Thank you
• asked a question related to Natural Language Processing
Question
Natural Language Processing
Sketch Engine is quite robust
• asked a question related to Natural Language Processing
Question
I know some basic approaches that can be used on languages with rich morphology.
1. Stemming
2. Lemmatizing
3. Character n-grams
4. FastText embeddings
5. Sentencepiece
I would like to know if there any more recent development and what the researchers feel about the robustness of each method in specific domains (Indic languages etc.)
Hi,
here is a link to an old paper of mine. It discusses pros and cons of different approaches up to 2010 or so.
Br, Kimmo
• asked a question related to Natural Language Processing
Question
I'm looking for datasets containing coherent sets of tweets related to Covid-19 (for example, collected within a certain time period according to certain keywords or hashtags), containing labels according to the fact they contain fake/real news, or according to they fact they contain pro-vax / anti-vax information. Possibly, the dataset I'm looking for would also contain a column showing the textual content of each tweet, a row showing the date, and columns showing 1)The username /id of the autohor; 2)The username/id of the people who retweeted the tweet.
Do you know any dataset with these features?
@Luigi Arminio,
This attached screenshot maybe useful. Thanks~PB
• asked a question related to Natural Language Processing
Question
I have a collection of sentences that is in an incorrect order. The system should output the correct order of the sentences. What would be the appropriate approach to this problem? Is it a good approach to embed each sentence into a vector and classify the sentence using multiclass classification (assuming the length of the collection is fixed)?
Please let me know if there can be other approaches.
Something you could do is to identify linguistic rules that suggest a certain order. For example, before a personal pronoun can be used, a distinct name must be introduced, and the genus must agree with it.
He gave her an envelope.
She went to Peter.
Mary entered the room.
A knowledge base must provide information about actions that are done by objects of certain type, such that giving is an action performed by humans (and not by rooms), and information about gender of names.
Regards,
Joachim
• asked a question related to Natural Language Processing
Question
I have set of tags per document, and want to create a tree structure of the tags, for example:
Tags:
- Student,
- Instructor,
- Student_profile,
- The_C_Programming_Language_(2nd Edition),
I need to generate a hierarchy as per the attached example image.
Are there Free taxonomy/ontologies which can give Parent words? like
get_parent_word( "Student", "Instructor") = 'People'
is_correct_parent(parent: "Student", child: "Student_profile") = True
I have a corpus of English as well as Technical documents and use Python as the main language. I am exploring WordNet and SUMO Ontology currently, if anyone has used them previously for a similar task or if you know something better I would really appreciate your guidance on this.
Bahadorreza Ofoghi , thanks for sharing, it looks interesting.
• asked a question related to Natural Language Processing
Question
I have been investigating some research topics about code-mixing for some down-stream tasks in NLP. More excatly, it is a bit hard to find some code-mixed corpus for cross-lingual sentence-retrieval task.
• asked a question related to Natural Language Processing
Question
Hello everyone,
I am looking for a repository of corpus building for the domain of Sentiment Analysis for the Bangla/Bengali language.
• asked a question related to Natural Language Processing
Question
Self-supervised learning: in which domain between NLP, Computer Vision and Speech, it is used the most ?
You're welcome Titas De
• asked a question related to Natural Language Processing
Question
Hello everyone,
Could you recommend courses, papers, books or websites about wav audio preprocessing?
Thank you for your attention and valuable support.
Regards,
Cecilia-Irene Loeza-Mejía
There are many softwares you can use to process WAV files, e.g. Audacity and Matlab. Maybe it's better if you write which feature you want to extract in WAV files. This way we'll be able to help you with more precision.
Blessings from Brazil,
Fernando
• asked a question related to Natural Language Processing
Question
Hello!
I'm writing a systematic review article on Natural Language Processing (NLP) and planning to submit the paper to a Q1 journal. Would you please recommend a list of free Q1 journals from where I will receive a fast decision?
Thank you so much everyone for the kind cooperation.
• asked a question related to Natural Language Processing
Question
Hello everyone
I am looking for a repository database for the domain of Sentiment Analysis for the Arabic language.
Hi Hicham
Please have a look at the following GitHub repo:
I hope this helps.
Good luck!
• asked a question related to Natural Language Processing
Question
Greetings, I am very enthusiastic about Natural Language Processing. I have some experience with Machine learning, Deep learning and Natural Language Processing. Is there anyone who is willing to work in collaboration?
Kindly ping me. Regards and thanks.
• asked a question related to Natural Language Processing
Question
I am trying to implement a VQA model in ecommerce, and would love to have a dataset that focuses on fashion (or any ecommerce type of goods). If there isn't an available one, is synthetically generating q&a pairs for a given image a good idea? If so, any idea how to approach such problem ?
• asked a question related to Natural Language Processing
Question
I have a data set that contains a text field for approximately more than 3000 records, all of which contain notes from the doctor. I need to extract specific information from all of them, for example, the doctor's final decision and the classification of the patient, so what is the most appropriate way to analyze these texts? should I use information retrieval or information extraction, or the Q and A system will be fine
DEAR Matiam Essa
This text mining technique focuses on identifying the extraction of entities, attributes, and their relationships from semi-structured or unstructured texts. Whatever information is extracted is then stored in a database for future access and retrieval.The famous technique are:
Information Extraction (IE)
Information Retrieval (IR)
Natural Language Processing
Clustering
Categorization
Visualization
With the increasing amount of text data, effective techniques need to be employed to examine the data and to extract relevant information from it. We have understood that various text mining techniques are used to decipher the interesting information efficiently from multiple sources of textual data and continually used to improve text mining process.
GOOD LUCK
• asked a question related to Natural Language Processing
Question
Is there any AI-related (mainly NLP, Computer Vision, Reinforcement Learning based) journal where I can submit short papers? It should be non-open access.
You may check it:
Artificial Intelligence An International Journal - Elsevier
• asked a question related to Natural Language Processing
Question
In which application of Machine Learning ( NLP, Computer Vision, etc ) would we find maximum value with Semi-Supervised Learning and Self-Training ?
• asked a question related to Natural Language Processing
Question
I am trying to build a model that can produce speech for any given text?
i could not find any speech cloning algo that can clone the voice based on speech only so I turned to TTS(Text-to-speech) models. I had the following doubts regarding data preparation?
As per LJSpeech dataset which has many 3-10 sec recordings we require around 20 hours of data. It will be very hard for me to build these many 10 sec recordings. What would be the impact if I make many 5 min recordings. One could be high resource req (but how much), are there any others.
Also is there some way through which I could convert these 5 min recordings as per LJSpeech format
• asked a question related to Natural Language Processing
Question
Hi everybody,
I would like to do part of speech tagging in an unsupervised manner, what are the potential solutions?
Dear Dr Fatemeh Daneshfar . See the following useful RG link:
• asked a question related to Natural Language Processing
Question
Looking for any work on structured prediction.
• asked a question related to Natural Language Processing
Question
What are the latest advances in zero-shot learning on NLP ?
• asked a question related to Natural Language Processing
Question
Consider there is a record of 100 values ,with different errors in data such as NULL, duplicate values or improper format. Is it possible to cluster those data as per errors and display the reason for it using NLP?
• asked a question related to Natural Language Processing
Question
I developed an approach for extracting aspects from reviews for different domains, now I have the aspects. I want some suggestion on how to use these aspects in different applications or tasks such as aspect based recommender system.
Note: Aspect usually refers to a concept that represents a topic of an item in a specific domain, such as price, taste, service, and cleanliness which are relevant aspects for the restaurant domain.
• asked a question related to Natural Language Processing
Question
I'm searching about autoencoders and their application in machine learning issues. But I have a fundamental question.
As we all know, there are various types of autoencoders, such as ​Stack Autoencoder, Sparse Autoencoder, Denoising Autoencoder, Adversarial Autoencoder, Convolutional Autoencoder, Semi- Autoencoder, Dual Autoencoder, Contractive Autoencoder, and others that are better versions of what we had before. Autoencoder is also known to be used in Graph Networks (GN), Recommender Systems(RS), Natural Language Processing (NLP), and Machine Vision (CV). This is my main concern:
Because the input and structure of each of these machine learning problems are different, which version of Autoencoder is appropriate for which machine learning problem.
Regards,
Shafagat
• asked a question related to Natural Language Processing
Question
Hi,
I am I studied finance in my masters and worked in financial institutions. I have worked in automation of risk and compliance. Currently planning to have a PhD connected to Artificial Intelligence. Based on reading some articles online, I have come up with a list of PhD topics.
Could you please help me find which one is best form this list? Or any other new idea is also welcome. Thank you
1. Cost Benefit analysis of Implementing AI in GRC (Governance, Risk, and compliance) of Financial Institutions
2. ROI of Implementing AI in GRC
3. Application of AI in Automation, Data Validation, Cleansing
4. Application of Natural Language Processing in GRC for Categorization and Mapping
5. Approach to implement AI, whole Transformation vs Hybrid adoption
6. Benefits and Challenges for early adopters Financial Institutions of AI
7. Role of AI in reducing behavioral biases in Risk Management
8. Ai based Entrepreneurship and Innovation
9. AI in Risk management of Hedge Funds
The best way is to start a comprehensive Literature Review with understanding what have been done and what is need to be done with WHY + which benefits, then to know a bit how you will do it within certain time frame.
Good Luck
• asked a question related to Natural Language Processing
Question
I wanted to know the applicability of DL language technologies as applications of NLP
Aklilu Elias Deep Learning and Natural Language Processing are both subsets of the greater area of Artificial Intelligence. While NLP is reinventing how robots comprehend human language and behavior, Deep Learning is expanding the scope of NLP applications.
• asked a question related to Natural Language Processing
Question
I have some Key Informant Interview (KII) data. I want to apply Natural Language Processing (NLP) to identify the pattern in the data. Can applying NLP for analyzing KII be mentioned as data analytics tools in the report/paper?TIA
Of course, it is an interesting work. For example, (1) using NER (Named Entity Recognition), RE (Relation Extraction) to construct Knowledge Graph, then analyzing the relations between the interviewees or the knowledge constitution of an interviewee ; (2) using EE (Event Extraction) to identify the event correlation between the questions and answers; (3)using SA (Sentiment Analysis) to analyze the attitudes toward to the interviewer or the company, etc; (4) using topic models to analyze the topics about the interview and finding out which topic the interviewers are most interested in; etc.
Many,many interesting jobs you can do by using NLP analysis. Wish you finished an interesting paper in some days.
• asked a question related to Natural Language Processing
Question
Application of Natural Language Processing (NLP) and Text-Mining of Big-Data to Engineering-Procurement-Construction (EPC) Bid and Contract Documents
Have a quick with these articles and link:
Kind Regards
Qamar Ul Islam
• asked a question related to Natural Language Processing
Question
I am writing a thesis in which I'm trying to see what the relationship is between how innovative a patent is (as determined by a series of measures based on NLP that) and the financial value of that patent (as determined by a different measure, in millions of dollars). My dataset contains around 2 million US patents.
I'm basing my work on:
The text-based innovation measures developed by my professor (with NLP, count data).
E.g. new_word counts how many times the focal patent used a word that was not found in all patents published prior to the focal patent. New_word_reuse counts the number of subsequent patents that reused the new word introduced by the focal patent.
A financial value developed by a different researcher (continous, in millions of dollars).
By linking these two I now have a dataset with 11 text-based innovation measures (count data), 2 financial value measures (one in real dollars deflated by CPI, one in nominal value) and a variable containing the number of times each patent has been cited in different patents (cites, also count).
My training is not in statistics (though I did follow a two Stats courses in the past two years) so I'm having some trouble finding the correct model/fit. All other regressions I've done in the past were simple OLS regressions with relatively little data transformation needed.
The dataset I'm working with now is something else. Many of the variables have a high number of patents with 0 as values. Nearly all of them are quite skewed.
I've tried many different types of regression and data transformations, and I keep getting significant parameters yet an extremely low R2 (I'm assuming because of the large sample size).
I'm not sure what my next steps are to diagnose the issues and find a correct specification.
I've added pictures of summary statistics for all variables and a picture of a regression with all variables included (just to show everything) and one where I left out some variables that were insignificant or heavily correlated to other variables.
I probably made a lot of newbie mistakes so apologies, I'm trying to catch up to all the stats I used to have.
Any push in the right direction would be much appreciated!
Thanks!
Hello Mo,
if the count is the intended outcome, a hurdle model will be the best choice IMHO.
BTW: As I am also interested in patents, can you share some resources for the NLP patent analysis?
All the best,
Holger
• asked a question related to Natural Language Processing
Question
i am doing project on automated classification of software requirement sing NLP and machine learning approach i.e. Naive Bayes. For this i require dataset of classified software requirements. i have searched PROMISE data repository, but didnot find dataset according to my need. can someone help me it will be highly appreciated if someone tell me from where i can find and download this dataset.
The PROMISE dataset is here: https://doi.org/10.5281/zenodo.268542
The PURE dataset is here: https://doi.org/10.5281/zenodo.1414117
• asked a question related to Natural Language Processing
Question
Hi
I am planning to start a PhD but I do not have any idea. I prefer to have an idea that does not need a lot of programming . I studied the basics of Cloud computing and IoT and for the NLP I did not study it before so please if you think that there is any idea that might help me please share
Also, what should I read in order to find a gap in the knowledge for any area
Thank you Md Shofiqul Islam , Anupama Sakhare Arturo Geigel for your valuable comments
• asked a question related to Natural Language Processing
Question
I am looking for sexual text public dataset :
Like:
"Hey baby it’s saying the money u sent is on hold ..."
Words like baby, darling, asking for phone number?
Definition:
3: Intercourse, masturbation, porn, sex toys and genitalia
2: Sexual intent, nudity and lingerie
1: Informational statements that are sexual in nature, affectionate activities (kissing, hugging, etc.), flirting, pet names, relationship status, sexual insults and rejecting sexual advances
0: the text does not contain any of the above
The data that I have found till now is:
• Jigsaw Unbiased
• Pornhub dataset
Apart from these are there any other text data-set? Paper/models are well appreciated.
Hi Pratik Kumar Chhapolika , besides the datasets you already mention, I also do not happen to know something useful
I however came across a few quite interesting papers that might add to your research:
1) This oddly philosophical view on pornographic language and its influence in society:
2) Quite a bunch of papers surrounding the measurement and neutralization of sexually explicit language published by George Weir that can be downloaded directly from this list:
Some of these papers mention labelling datasets. So maybe its worth contacting Prof. Weir and ask if he might share those labeled data.
Anyways, good luck with your project!
• asked a question related to Natural Language Processing
Question
Can any one please suggest me Tamil handwritten "sentence" datasets available source / link .
Have a look:
Kind Regards
Qamar Ul Islam
• asked a question related to Natural Language Processing
Question
I’m looking for PhD topic to get started. lm interested in integrating between healthcare domain ( Biometric analysis or disease prediction ) and Machine learning/ deep learning or Natural language processing ( NLP), any suggestion topics or papers are highly appreciated.
Since you (i) work with prediction of the epileptic seizure based on the analysis of ECG and EEG signals, and (ii) have asked questions about body sensor networks, then you could apply the techniques mentioned by @Qamar_Ul_Islam to solve these issues.
It is also helpful to discuss the topics with your advisors. Since you mentioned a PhD topic, it means you probably worked or published something about these topics.
• asked a question related to Natural Language Processing
Question
Hello everyone, Currently trying to work in Relation Extraction (RE) and Named Entity Recognition (NER). I'm looking for models and code to extract the relations from large documents. I just know these two models, however, code is not complete in terms of visualization of relations and entities.
SpERT: Span-based Joint Entity and Relation Extraction with Transformer Pre-training.
CMAN: Dynamic Cross-modal Attention Network.
I'm wondering if anyone has worked with these models for large amounts of social or scientific data. Or even knows of a better model.
Thanks!
Relation Extraction (RE) is the task of extracting semantic relationships from text, which usually occur between two or more entities. This field is used for a variety of NLP tasks such as creating Knowledge Graphs, Question-Answering Systems, Text Summarization, and so on.
These articles seems to be useful, have a look:
Kind Regards
Qamar Ul Islam
• asked a question related to Natural Language Processing
Question
I have some raw text data, which is pretty "dirty". My goal is to evaluate whether proper data cleaning can improve the accuracy of an NLP classification model. The data is firstly splitted to training and testing sets, it is natural to train two models based on cleaned training set and raw training set, respectively. The problem is whether the models should both be tested on raw test set, or on a testset which is cleaned by the same method as the training set?
Hello!
Your goal is: "to evaluate whether proper data cleaning can improve the accuracy of an NLP classification model"
To my understanding you want to evaluate the classification. For me it means that you need two models, one trained on raw data and one trained on cleaned data. But to compare the two models you need to use the very same raw test data. It will show how your model will classify if it faces dirty data.
If you want to evaluate the cleaning process itself, than you need to test on a cleaned test data. It will show how necessary to clean your data before doing the classification.
• asked a question related to Natural Language Processing
Question
I am currently studying QA systems in NLP. I found that the term MRC and QA System are interchangeable. Then, I found this page https://www.quora.com/What-is-the-difference-between-machine-comprehension-and-question-answering-in-NLP that states MRC is an approach to solve QA problem. what is another approach?
QA systems can be roughly divided into:
1. Query processing
2. Database storage(e.g. GraphDBs, unstructured documents,SQL, etc. which goes to the processing for storage and retrieval on such information) architecture and methods of retrieval
3. Document processing(e.g. extracting the relevant piece of information after it has been extracted from the storage)
4. Answer formatting (e.g. frame filling, etc.)
Each stage can include multiple approaches. Could you be more specific on which part of the QA system are you interested on?
• asked a question related to Natural Language Processing
Question
Hello,
I read some papers they used Pre-processing steps with text that will classify based on Sentiment Analysis.
My question is, can I use text Pre-processing techniques in the sentiment analysis classification, such as the Stop Words Removal and Stemming techniques? If can, it will cause Negations Words or Negation Prefix to be deleted, such as ( I am Not happy), will be (happy) after we use Stop Words Removal technique or (unlucky) will be (lucky) after we do Stemming process. That's mean the sentence that should classified into negative class will be classified into positive class. How to deal with that?
Check out the following link, it might be useful for you:
Why is removing stop words not always a good idea
• asked a question related to Natural Language Processing
Question
Machine learning is a sub field of Artificial Intelligence (AI) that deals with learning the computer to make decisions. One of its main research areas is Natural Language Processing that has numerous applications in various field.
This post is dedicated to all researchers interested in NLP either those that are actually working on some research topics related to it or those who want to enter the domain.
I would appreciate if you could share with us your thoughts, research topic and experience in working with NLP methods and techniques.
• asked a question related to Natural Language Processing
Question
Hey everyone
I am a newbie in this domain and looking for some research topic In the domain of NLP. Can anyone help me to find a new research topic?
• asked a question related to Natural Language Processing
Question
I have studied about attention mechanism and have seen its application in Natural Language Processing (NLP) and Computer Vision(CV). NLP is not my area of interest. However, CV is my area of interest but attention mechanism is found to be applicable in Image Captioning, which is a part of Information Retrieval. Image Captioning deals with long subsequences of text to convert unstructured image-related data to structured data. It does not relate directly with images but extracts text-based information from images. I want to use attention mechanism directly upon images, through its integration with CNNs especially for better feature extraction , selection and classification. Suggestions would be of great help.
• asked a question related to Natural Language Processing
Question
We are an active Arabic Natural Language processing (NLP) and AI research group doing research in Deep learning, machine learning and social network analysis for Arabic NLP.
We are looking for an RA that can work remotely on a number of NLP/Deep learning/Machine Learning projects, where can we find such candidates?
Responsibilities:
Data cleaning, analysis and visualization using various approaches.
Ability to conduct literature review and summarize them in a coherent way.
Ability to implement different ML/DL approaches using different datasets to serve specific NLP problems.
Ability to fine tuning BERT/AraBert and its different variations to serve specific NLP tasks.
Ability to communicate the experiments and results in clear English language.
Required Minimum Qualifications:
Master/PhD in computer science.
Experience in Python (including numPy, sciPy, pandas, matplotlib)
Excellent working knowledge of Deep learning/Machine Learning.
Experience with word embeddings, BERT, etc.
Ability to clearly communicate technical ideas in English.
Motivated, Independent, self-learner and ability to work with diverse team.
Excellent verbal and written communication skills are required.
That is interesting. I am fully occupied, I will share that with my colleagues and students, if you are still looking.
• asked a question related to Natural Language Processing
Question
I want to perform sentiment and context analysis for a literature review project. Can you please suggest appropriate tools which can be used for the same?
Toloka, NLTK
• asked a question related to Natural Language Processing
Question
How do you interpret different concepts in NLP for time-series? For example “self-attention” and “positional embeddings” in transformers.
There are numerous benefits to utilizing the Transformer architecture over LSTM RNN. The two chief differences between the Transformer Architecture and the LSTM architecture are in the elimination of recurrence, thus decreasing complexity, and the enabling of parallelization, thus improving efficiency in computation.
Kind Regards
Qamar Ul Islam
• asked a question related to Natural Language Processing
Question
My background is in Speech/Audio enhancement and separation. I want some suggestions for using my background in Natural Language Processing topics.
Spontaneously, I have not a concrete suggestion, but I am curious where you are taking the Náhuatl-recordings from? I am now working with Sabina de la Cruz, an L1-Nahuatl speaker, who has a 10-month-old daughter. She is currently trying to establish Nahuatl as the language of use in her family so that her little daughter also acquires Nahuatl as her L1.
She has been recording the babble production of her child and I am in charge of analyzing the data.
• asked a question related to Natural Language Processing
Question
Because a human thought is interconnected with a language, what do you think about the Integration of Natural Language Processing (NLP) with Deep Learning (DP)? I think that it is the main way to build General Artificial Intelligence.
What approaches are used in the Integration of NLP with DP? What are trends in this area?
Dear Amin Honarmandi Shandiz , thank you for your contribution. It is very interesting paper. On the other hand, the Integration of Vision and Language Processing is only one part of the way to implementation into AI the understanding of meaning.
• asked a question related to Natural Language Processing
Question
Do you know a NLP research about extraction of sarcastic, metaphorical, polemical and rhetorical phrases in texts? For example, in the text “Find your patience before I lose mine.”
• asked a question related to Natural Language Processing
Question
Regarding the subject of handling imbalanced data (specifically in NLP):
1. Is there any benefit to balancing nearly-balanced classes?
(say, majority to minority data ratio of 60:40 or even 55:45)
2. Can such a procedure cause more harm than good?
Thank You.
Dear Raz Malka ,
When the imbalance ratio is nearly 1:1, i.e., 55:45 or 60:40 majority and minority class ratio, You may not need to balance the dataset either using oversampling or undersampling. This much imbalance is very trivial in real-life datasets.
But, if the minority class examples are very important for the correct predictions (e.g- disease datasets), and you don't want any minority class data to be left out due to an imbalanced dataset as it may cause severe effect, you can perform undersampling or oversampling to have better and accurate results.
Regards
Sayan Surya Shaw
• asked a question related to Natural Language Processing
Question
please show me some SCIE journals that I can submit papers related to dataset/corpus in NLP/CL?
If you want to publish your dataset, you can submit it to Data in Brief
• asked a question related to Natural Language Processing
Question
I am looking for Post Doc opportunities in the field of Natural Language Processing and DL/ML. I did my PhD from IIT. If anyone suggest anything.
• asked a question related to Natural Language Processing
Question
I am using Hugging Face mrm8488/longformer-base-4096-finetuned-squadv2 pre-trained model
I want to generate sentence level embedding. I have a data-frame which has a text column.
## Objective:
Create Sentence/document embeddings using **LongformerForMaskedLM** model. We don't have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?
## Environment info
- transformers **version:3.0.2**
- Platform:
- Python version: **Python 3.6.12 :: Anaconda, Inc.**
- PyTorch version (GPU?):**1.7.1**
- Tensorflow version (GPU?): **2.3.0**
- Using GPU in script?: **Yes**
- Using distributed or parallel set-up in script?: **parallel**
## Information
I have fine-tuned LongformerForMaskedLM and saved it as .bin file. I try to use this model to generate embeddings for every documents ( That is one row of pandas DataFrame)
## Code:

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
#**news_article** column is used to generate embedding.


all_content=list(df['news_article'])
def sentence_bert():
list_of_emb=[]
for i in range(len(all_content)):
SAMPLE_TEXT = all_content[i] # long input document
input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)
# How to include batch of size here?
# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
attention_mask[:, [0,-1]] = 2 # Is this correct?
hidden_states = outputs[2]
token_embeddings = torch.stack(hidden_states, dim=0)
# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
token_vecs_sum = []
# For each token in the sentence...
for token in token_embeddings:
#but preferrable is
sum_vec=torch.sum(token[-4:],dim=0)
# Use sum_vec to represent token.
token_vecs_sum.append(sum_vec)
h=0
for i in range(len(token_vecs_sum)):
h+=token_vecs_sum[i]
list_of_emb.append(h)
return list_of_emb
f=sentence_bert()

## Doubts/Question:
1. The code replaces 1st token <s> and last token  <\s> of document by value 2. Is this correct approach to get global attention to get embeddings for one documents.?
2. The code put model in evaluation mode. What does it do? model.eval()
The output is a tuple of size 2:

outputs[0] gives us sequence_output:
torch.Size([1, 34, 50265])
length of outputs[1] gives us hidden states 13
outputs[1] gives us hidden states
torch.Size([1, 512, 768]) that is [13, 512, 768]

3. What does outputs[0] of dimension  torch.Size([1, 34, 50265])  signify ? 34 is my sentence length and 50265 is vocabulary size. I understand that it is logit output. But how to interpret it in simple english?
4. How this code can be corrected or changed to get correct document embeddings? Or any other approach would be helpful.
5. In the code it uses last 4 hidden layers and sums it up. What can be done to just normalize it wit attention and then take average?
## Expected behavior
Document1: Embeddings
Document2: Embeddings
I have an idea using extractive summarization for context/passage before feeding it into reader. I illustrate this in the paper:
• asked a question related to Natural Language Processing
Question
Hi All! I am looking for a public dataset that includes a lot of text (such as diary entries) and psychological outcomes such as well-being, cognitive ability, mental health, etc... Does anyone know of any datasets that would be appropriate?
Thanks so much!
-Elliott
Hi Elliot,
It could be a valid starting point to connect with the datasets owners
Regards,
Mauro
• asked a question related to Natural Language Processing
Question
BERT has token length limits with respect to handling text.
And I want to classify each large paragraph of text if it is appropriate or inappropriate for kids.
Splitting the paragraphs (large text) into smaller chunks is not helpful as the ground truth labelling for training is for the entire text and not individual chunks, for which it may vary.
If we split paragraphs into sentences and then do it, the training may be polluted because in many texts only a small part in the large paragraph may be inappropriate and the remaining individual chunks will also get trained as inappropriate.
Therefore, I need to process the full paragraph in one go.
#NLP #ML
Hi.
Firstly, I'd like to suggest you process an input text using a classical LSTM-based neural network. It may help to omit the problem of the fixed length of input signals. For instance, you can represent the words of a text in a vector form using some semantic embedding model (e.g., ELMo). Then you can pass this set of signals through the LSTM cell; the output value will be represented in a vector form and can be processed by a simple binary classifier (few feedforward layers).
However, I'm not sure that the analysis of the sequence of words will be effective for your task. Instead, I propose to consider a text at the level of sentences. It may help to reveal "bad" text spans while analyzing the whole text. That's why I'd like to suggest the following algorithm:
1. Split an input text into a set of sentences.
2. Represent each sentence as a set of vectors using a semantic embedding model.
3. Pass each sentence through a "Sentence model" that consists of LSTM cells. It may help to represent each sentence as a vector.
4. Pass obtained sentence vectors through an additional LSTM layer. Then the output vector is processed by a binary classifier (dense layers) providing the probability of the appropriateness of a text.
My research touches on similar problems (binary classification of a whole document) so maybe I will be able to provide you with some close solutions.
Best regards,
Artem Kramov
• asked a question related to Natural Language Processing
Question
I want to use CNN in NLP for sentiment analysis with two different input networks, one with LSTM and another with CNN. But to concatenate both the outputs do I need to fix the kernel size of CNN to 1?
Neural networks are a set of algorithms designed to recognize patterns. These patterns are numbers contained in vectors that are translated from real-world data such as images, sound, text or time series. A convolutional neural network is a neural network that applies convolutional layers to local features.
Regards,
Shafagat
• asked a question related to Natural Language Processing
Question
I am trying to develop architecture for Customer-Agent smart reply system for chat bot. I am taking help of Googles paper published in 2016:
This talks about smart reply in Gmail use case scenario. And our objective is to do it for chat-bot case scenario where Agent and Customer Utterances happen.
The problem that I am facing is:
How should I structure my Input/ Feed my input (i.e, Agent-Customer Utterances) for the downstream model to get the top 3 or 5 responses? How model should understand that which is Agent and which is Customer Utterances?
Input looks like:
Agent: Welcome to XYZ. How can I assist you?
Customer: I am facing issue withe internet?
Agent: May I know your registered mobile number?
Customer: My mobile number is XXXX. My email ID is XXX@yyy.com.
Customer: Please resolve my issue asap.
Customer or Agent can have more than one utterances and one Utterances can have more than 1 sentences in that session.
How the input should be fed to model so that we get desired smart reply?
Like in the above example it can be:
Agent Smart Reply 1: We are working on it Mr.XXX
Agent Smart Reply 2: Your issue is resolved. Thanks for contacting XYZ
Agent Smart Reply 3: Your issue will take some time to get resolved. Please wait for 24 -48 hrs.
How we can fed the Customer-Agent Utterances to model? Any paper with reproducible code or any new suggestion will be helpful.
These codes might be useful, have a look:
Kindly let me know if it is helpful.
Kind Regards
Qamar Ul Islam
• asked a question related to Natural Language Processing
Question
Is there an automated technique to convert a text into its graphical form, so that it can be added to a knowledge base ? WikiData, Inquire and other systems have build their Knowledge Base manually
Well, I don't think you have "the" answer. The text you mention describes whole areas within NLP: named entity extraction, relation extraction, coreference resolution, etc. None of them work as well as we want them to. There are dozens and dozens of systems for each of one these tasks and many more ways of putting them together.
• asked a question related to Natural Language Processing
Question
I am trying to build a voice cloning model. Is there some scripted text I should use for the purpose or speak anything randomly?
What should be the length of the audio and any model suggestions that are fast or accurate?
Text to Speech Synthesis is a problem that has applications in a wide range of scenarios. They can be used to read out pdfs loud, help the visually impaired to interact with text, make chatbots more interactive etc. Historically, many systems were built to tackle this task using signal processing and deep learning approaches.In this article, let’s explore a novel approach to synthesize speech from the text presented by Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno and  Yonghui Wu, researchers at google in a paper published on 2nd January 2019.
Regards,
Shafagat
• asked a question related to Natural Language Processing
Question
The WuDao 2.0 model of China Neural Network (NN) with 1,75 trillion parameters topped the 1.6 trillion that Google unveiled in a similar model in January 2021. Is it the start of Race to Quadrillion parameters NN? Do you have additional information about the structure and design of such ultra big NN?
This reminds me of the Japanese Fifth Generation program. Working bigger and harder with the same methods will probably not lead to major leaps.
• asked a question related to Natural Language Processing
Question
Journal for Survey paper in the field of Machine Translation or NLP.
• Discrete Mathematics and Theoretical Computer Science.
• Journal of ICT Research and Applications.
• Journal of the Brazilian Computer Society.
• Interdisciplinary Journal of Information, Knowledge, and Management.
• Journal of Computing and Information Technology.
• International Journal of Computer Science in Sport.
• asked a question related to Natural Language Processing
Question
I am working on a machine learning research project that uses Twitter text data. I haven't found books and articles that support my research project in natural language processing(text data).
How can I extend my research project to Explainable AI in Natural Language Processing?
What methods can Use?
Can recommend me articles and books in Xai research?
In the last month I made a literature research regarding XAI (explainable AI) and I can recommend you to XAI overview paper:
Moreover, I can you recommend you the online book: Interpretable Machine Learning - A Guide for Making Black Box Models Explainable: https://christophm.github.io/interpretable-ml-book/
Natural Language Processing is not my research field, so I cannot recommend you some special XAI methods for your research. But I hope that the above listed literature helps you.
• asked a question related to Natural Language Processing
Question
I could not find how to add semantics to my questions on Al-Quran.
• asked a question related to Natural Language Processing
Question
Hello!
currently I am trying to find datasets for document level translation in which not just sentence to sentence level translation datasets.
Any suggestions?
I use this tool https://www.systransoft.com/ and I start trying https://www.deepl.com/translator that recommended Wolfgang R. and works great, check this out and compare. Translation by computer still not precise 100%
• asked a question related to Natural Language Processing
Question
I have start my PhD work in the field of NLP and I want to know the best ISI journals for publishing NLP papers?
• asked a question related to Natural Language Processing
Question
Is IPOPT able to solve distributed optimization to find Nash Equilibrium?
\forall i \in N
min_{u_i} J^i (s, u_i,u_{-i})
s.t. h^i(s, u_i,u_{-i}) \leq 0
where J^i and u_i is cost function and input of agent i (i=1,...,N), s is shared state and u_{-i} is the others input. h^i is possible inequality constraints.
please share some efficient algorithm for solving this distributed NLP. (I want to use it in MPC)
• asked a question related to Natural Language Processing
Question
Dears,
I am trying to compile a selected list of (free) material for learning the various aspects of AI in the broad sense and that should range from ML to NLP. Your help is very welcome. Did you found any particular course/resource useful? Please drop the link in the discussion and share why.
If you are curious, here you can find the current list:
Why do you state that "Deep Learning can't help you"? Have you ever tried to use it? I do not have any experience with it, so that is why I am asking about it!
• asked a question related to Natural Language Processing
Question
I have a text data which has two columns:
81354 Friends
53014 Gravitation and spacetime
3067 Mapping desire, geographies of sexualities
941 British civilization, an introduction
......
The first column is the label which could be up 27 digits (but most of them have 3 or 4 digits) and could belong to the class 0-9 (the starting digit). As you can see we have like thousands of classes (when considering the next digits)
My first problem is that this data is imbalanced and the entries that start with number 3 are many many more compared to the other classes. (see first image ddc_group_counts.png)
I've been looking through oversampling methods but they are mostly for numeric data. I unfortunately can not convert my data since I need to later feed the text data into a neural network (BERT network by Google research team).
So are there any methods I can use to generate some more text data for the minor classes?
Can you refer me to any paper which have done something similar or can help me?
The other point is when I inspect the data inside each of those 9 groups (by considering only the first two digits in their labels like: 00, 01, 03, ... 20, 21, 23,... and so on) it again shows an highly imbalanced structure inside each of the 9 classes (two other images).
Are there maybe methods to make the distribution of such a text data more uniform?
Can you refer me to any papers on this?
thanks
• asked a question related to Natural Language Processing
Question
I have a large-scale MINLP optimization problem, implemented in GAMS. To do so, I decomposed my problem into two sequential stages of MILP and NLP. For solving the MILP stage, I am using Cplex and for the NLP stage, which is a nonconvex optimization problem, I am using BARON. The MILP stage has about 8000 variables and the NLP stage has 2000 variables.
Everything works perfectly so far, but now I need to add a neural network on top of my optimization problems to determine some of the parameters. For this aim, I can use either MATLAB or Python.
Now my question is, does Pyomo have enough capability and solvers to solve both large-scale MILP and nonconvex NLP problems, so that I can deal with the only Python for both optimization and neural netwotk, or It would be better to use GAMS for my optimization problem and implement my neural network in Matlab or Python and link it to GAMS?
I agree with professor Muhammad Ali .
MATLAB is the best option. And if you want a free program, you could use Python.
• asked a question related to Natural Language Processing
Question
If a dataset contains more punctuation, what will be its effects on the training phase using deep learning techniques.
Sumit Singh Chauhan Unfortunately, I'm not an expert in Chatbot development. However, according to this paper () the punctuation needs to be accounted for. This paper is open-access.
• asked a question related to Natural Language Processing
Question
I am working on a project to identify sensitive information from text and I am looking into the below categories:
-health
-politics
-crime
Can someone suggest an open source tool for the same?