Science topic

Web Scraping - Science topic

Explore the latest questions and answers in Web Scraping, and find Web Scraping experts.
Questions related to Web Scraping
  • asked a question related to Web Scraping
Question
4 answers
I am interested in analyzing some features of editor and reviewer comments during the peer review process. There doesn't appear to be an option to limit search results to articles that include reviewer reports in Scopus/Google Scholar/Web of Science, or even PLoS ONE. Does anybody know a workaround that doesn't involve web scraping?
(I know some journals like BMC publish reviewer reports as a standard practice, but I would like to search multiple journals that offer, but do not require, open peer review.)
Thanks!
Relevant answer
Answer
I think all EGU journals, such as Biogeosciences, allow you to download reviewer comments and responses, but I am not aware of any way that automatizes your search.
  • asked a question related to Web Scraping
Question
8 answers
Hello,
Recently I am trying to scrape yahoo finance financial tables for my Ph.D. study. I am using R and rvest package but te I realized I cannot scrape all the data on the tables, because of the "expand all" and "collapse all" buttons. Then I found out it is possible to click those buttons with RSelenium.
The button is indicated as "<div class="expandPf Fz(s)...."
my code is;
```
webElem <- remDr$findElement(using = "class", ".expandPf")
```
but I am getting error. I tried to it with "CSS" as well but same error message.
Does anyone know how to fix it?
  • asked a question related to Web Scraping
Question
10 answers
I am currently doing a web scraping study about the online prices of laptops. My objectives are
  1. to test if there is a significant difference in online prices of different laptop brands on different days of the month;
  2. to determine if there is a significant difference in online prices of different laptop ranges on different days of the month; and
  3. to test if there is a significant difference in online prices of laptop ranges per brand on different days of the month
I have already finished scraping the online prices of laptops last week. I am planning to use a two-way and three-way ANOVA to answer the objectives but as I was checking the assumptions of ANOVA, I have noticed that the homogeneity of variance is violated. I have used Levene's Test in checking the homogeneity of variance. Is it still okay to proceed in using two-way and three-way ANOVA?
Relevant answer
Answer
As Bruce said: the ANOVA is quite robust given the sample sizes are similar.
I think the more relevant question here is if the price difference is really a useful measure, or if actually the price ratio would be the more sensible statistic to analyze. Usually, price changes are proportional, they are also even usually given in percent. Hence, on the absolute scale, variation of prices of expensive laptops are expected to be larger than that of cheaper models. This all becomes relevant when you include laptops with rather different price tags, and it might be negligible when the laptops all cost about the same (but even then I would consider it smarter to analyze relative changes, for purely theoretical reasons and my understanding of the pricing politics and human behaviour [50€ more on a high-end laptop that costs 4000€ is usually considered irrelevant, wheras the same 50€ more for a laptop that is sold for just about 300€ seems inacceptable).
  • asked a question related to Web Scraping
Question
5 answers
dear community, in order to describe a website architecture design (pages , order of pages, titles,..) I tried to apply web scraping , but till now I can't find a tutorial on how to get this type of data as mentioned below .
Relevant answer
Answer
  • asked a question related to Web Scraping
Question
3 answers
I want to dive deeper into data analytics, I have done quite a few basic projects e.g get a random dataset, clean/analysis it, and then make visualizations. However I want to conduct a more challenging project, e.g some web scraping, maybe using SQL and Python to analyze data and visualize it, to help solve some real-world potential scenarios. But as a typical computer science student, my brain isn't creative and struggling to think of valid ideas. One idea I had is to try to gather uber driving data at my uni and try to store this data into a SQL database and do some cleaning and analysis and try to visualize more busy spots etc.
Would be much appreciated to gather some ideas from those established in this field! thank you.
Relevant answer
Answer
It is good that you already have an idea which is to identify some busy spots using Uber data. But what next. What problems you are going to solve, scale of the problem at national and regional level, and for whom (audience). For this, conduct scoping review or narrative review on given field or topic of interest to identify the research gap or generate hypothesis.
  • asked a question related to Web Scraping
Question
5 answers
Dear Community , in order to analyze a component website and its features ; I want to extract some data from the website , it's first time to hear about the web scraping, can you guide me please , how to do it using python for competition intelligence?
thank you
Relevant answer
Answer
I suggest that you try as many tutorials as you can find since web scraping can be a bit tricky at times. Even more difficult is the fact that websites vary so much in terms of layout. Basic knowledge of mark up languages like HTML and XML, however, can be extremely helpful. When it comes to Competitive Intelligence, it is no different; you just need to select the website and make your script accordingly. You should also check the website's policies to make sure scrapping is allowed. This can be done by appending the "/robots.txt" file to the URL. Please see below if it helps. Thanks
  • asked a question related to Web Scraping
Question
2 answers
Illegal wildlife trade (IWT) in cybermarkets is a growing issue around the world. Scientists, therefore, are adapting strategies to a new paradigm regarding biodiversity conservation at the internet speed: we need analytical methods in common, sensitize stakeholders, and join efforts to maximize our actions.
I seek for potential illegalities regarding the trade of Brazilian species on the internet and mostly use Web Scraper to sample pages and their content - which I train my "digital robots" to catch.
My main goals are to find, monitor, and expose these marketplaces in order to neutralize advertisements and sensitize public opinion on the risks of digital biopiracy (mostly IWT through internet).
If you know any other method for capturing information on the internet, please, let me know.
Thank you for your help.
Relevant answer
Answer
The topic you are researching is very interesting, but unfortunately I cannot supplement it with the aspect asked.
Keep up the good work!
Bests!
  • asked a question related to Web Scraping
Question
4 answers
Hi all,
Using requests in python we can get json files from different webpages. However, in some cases due to Cloud flare or different cookies, we cannot get the json files. On the other hand, some extensions of different browsers can get any jsons of a webpage. Can we connect a python script to a browser extension to get the json file?
If yes, how?
Thanks in advanced
Vahid
Relevant answer
Answer
I use "Network" tab of Inspect element to find the json url. In most of cases, using requests.get () works correctly, but in some pages Cloud flare or csrf-token cookies prevent from my connections. Do you know how we can pass each of these cases?
Thanks
  • asked a question related to Web Scraping
Question
10 answers
I need to collect a large amount of data using web scraping? What software is most suited for this? This is one off project. I am after a software/package that is quick to learn and easier to use. Any suggestions?
Relevant answer
Answer
that is so dependent on your work. but I prefer python programming. Softwares Cannot give the best results and always has a problem. you can use python programming and a lot of packages for that, such as Beautifulsoup, selenium, and scrapy.
  • asked a question related to Web Scraping
Question
3 answers
VirusTotal website is a tool for detecting malware Android apps. But it also gives various information related to the APK uploaded, like manifest tags, permissions etc. I want to extract these information for multiple APKs. How to scrap it and extract this data for multiple APKs ?
Relevant answer
Answer
  • asked a question related to Web Scraping
Question
4 answers
Hi,
I'm writing a code to scrape email addresses using 'scrapy' to search prospects for SEO link building.
I'm facing difficulties to filter out the unwanted outputs that I'm getting.
Presently, I can see 2 possibilities to proceed:
1) Manipulate the text by modifying 'Pattern Matching with Regular Expressions' using re.findall() function
2) Rejecting the unwanted texts from email addresses using 'reject' method.
Can anyone be able to help me with the code?
Best,
- Aashay Raut
Relevant answer
Answer
Anton Vrdoljak I'm already looking at the same. Thanks.
Mohammed Deeb I have already checked that code and I'm trying to mold that code according to my requirements. To validate email addresses it needs various filters to be added upon.
And, I need help with Regex.
  • asked a question related to Web Scraping
Question
7 answers
For the most part, I have found papers by following researchers on twitter and through blog posts, but I was wondering how other people keep informed. What kinds of methods have helped you? Apps? Reading lists? Daily schedule? Web scraping?
Relevant answer
Answer
What are some strategies for keeping up with the latest literature in a field?
My strategy is collect various sources of literature including Journal articles, text books, theses, Google Scholar searched outcome, RG links etc. that I have collected & reviewed & update an Excel log as per RG link below. The advantages of the Excel log are:
  1. help to know what literature artifacts you'd collected & reviewed (without the Excel you might not recall what you'd collected, completed reviewed or half-way reviewed etc.)
  2. to summarize the content of each literature you'd reviewed e.g. addressed issue, problem statement, research methodology, findings etc. (this push yourself to understand what are the high lights / key points you'd obtained from the artifacts reviewed)
  3. for your later stage analysis & synthesis - when you need to arrange & write your literature review section of your article / thesis
  • asked a question related to Web Scraping
Question
6 answers
I will need to scrape Instagram for public posts related to a specific hashtag as data for a content and visual analysis that is part of my project.
Now, as I understand it, scraping data for academic purposes are legal (and ethical if done right) - here in Norway, and in the US (where Instagram is situated). However, instagram's TOS states that "You can't attempt to create accounts or access or collect information in unauthorized ways. This includes creating accounts or collecting information in an automated way without our express permission.
My question is: Is it still possible for me as a researcher to scrape Instagram, or does this TOS-point weigh too heavily against it?
Relevant answer
Answer
Your best option is likely to contact Instagram and ask them. The plain reading of the ToS is that this is not allowed. I'm not a lawyer, but I think the GDPR also causes problems which effectively mean you may be restricted from scraping data on EU citizens.
  • asked a question related to Web Scraping
Question
5 answers
I am new to python and am making a web scraping program to automatic collect information about scholarship for me and I want to share it with my friends.
It has few dependencies.
It run perfectly on Pycharm and command line.
But when I try to convert it to .exe using pyinstaller . It just blink black screen. i put it on the same file that have dependencies. It doesn't work either.
Relevant answer
Answer
try this code:
You can use pyinstaller. First, install the pyinstaller using command prompt by executing
pip install pyinstaller
Then, in the command prompt, enter the command
pyinstaller filename.py
replace the filename with your python code file.
  • asked a question related to Web Scraping
Question
2 answers
Dear colleagues,
I need to scrape as much as I can for my PhD research, my research area in health communication, it investigates the role of mediated communication in public health, specifically focusing on anti-vaccine issue as a comparative study on vaccination messages in KSA & AUS, I will focus on one media platform: Twitter or Facebook.
I have used a scraper tool to collect data from twitter, I started with some hashtags for KSA, and added a few more hashtags I found. I noticed that some hashtags are used for spams, I tried to clean the data from spam as much as we can, but I may still find some spam tweets.
At the same time, I have found bad news, as Facebook and Instagram are banning anti-vaccination content, and seems like twitter is starting to do the same, a lot of the hashtags I'm trying in English but have very few and bad results, even if I'm not focusing on KSA or Australia as you see in this link:
As a result of that, I am facing two problems: How can I determine country in scraping data? and How can I translate data from Arabic to English for analysis as I will use lexomancer, and it does not work with Arabia content?
I need to be collecting as much data NOW as I can, so could you have any helpful advise in that please?
Many Thanks
Relevant answer
Answer
Health communication is a very broad area with many peer-reviewed papers.
  • asked a question related to Web Scraping
Question
20 answers
Many websites say in their terms that use of anything (text, pictures, etc) on their site is prohibited because it is their intellectual property. Does anyone know if it is actually illegal or legal to web scrape data from websites to use in research? Do I need to get permission from each individual website I want to scrape? Does the data need to be "anonymous" when published (i.e. someone can't determine which website it came from)?
Relevant answer
Answer
It is necessary to check both robots.txt and rules of publications in the website. For instance, in https://ssarherps.org/cndb/#c25ha2UmbG9vc2U9dHJ1ZQ there are a lot of reptiles images publicly available, however, it is prohibited to use them for commercial and research use without direct permission (https://calphotos.berkeley.edu/use.html).
Best regards,
Reza Sadeghi
  • asked a question related to Web Scraping
Question
7 answers
I want to extract the data from the internet with specific medical terms. Which are the available tools with best accuracy or output?
Relevant answer
Answer
The term data harvesting, or web scraping, has always been a concern for website operators and data publishers. Data harvesting is a process where a small script, also known as a malicious bot, is used to automatically extract large amount of data from websites and use it for other purposes. As a cheap and easy way to collect online data, the technique is often used without permission to steal website information such as text, photos, email addresses, and contact lists.
One method of data harvesting targets databases in particular. The script finds a way to cycle through the records of a database and then download each and every record in the database.
Aside from obvious consequence of data loss, data harvesting can also be detrimental to businesses in other ways:
  • Poor SEO Ranking: If your website content is scraped, reproduced and used on other sites, this will significantly affect the SEO ranking and performance for your website on search engines.
  • Decreased Website Speed: When used repeatedly, scraping attacks can lower the performance of your websites and affect the user experience.
  • Lost Market Advantages: Your competitors may use data harvesting to scrape valuable information such as customer lists to gather intelligence about your business.
  • I think if you refer it for research then concentrate on techniques rather than tools.
There are also a lot of other Data Mining techniques but these seven are considered more frequently used by peoples.
  • Statistics
  • Clustering
  • Visualization
  • Decision Tree
  • Association Rules
  • Neural Networks
  • Classification
  • asked a question related to Web Scraping
Question
3 answers
Hello Everyone,
need some help in using beautifulsoup library for webscrapping.
my goal is to get the extract text exactly as i the webpage for which I a extracting all the "p" tags and its text, but inside "p" tags there are "a" tags which has also some text.
so my questions:
1. how to convert the unicoded ("") into normal strings as the text in the webpage? because when I only extract "p" tags, the beautifulsoup library converts the text into unicoded and even the special characters are unicoded, so I want to convert the extracted unicoded text into normal text. How could I do that?
2. How to extract the text inside "p" tags which has "a" tags in it. I mean I would like to exract the complete text inside the "p" tags including the text inside nested tags.
I have tried with the following code:
news_soup = BeautifulSoup(html, "html.parser")
a_text = news_soup.find_all('p')
y = a_text[1].find_all('a').string
Relevant answer
Answer
from bs4 import BeautifulSoup
import requests
import re
#1 Recoding
unicode_str = html.decode("utf8")
encoded_str = unicode_str.encode("ascii",'ignore')
news_soup = BeautifulSoup(encoded_str, "html.parser")
a_text = news_soup.find_all('p')
#2 Removing
y=[re.sub(r'<.+?>',r'',str(a)) for a in a_text]
  • asked a question related to Web Scraping
Question
3 answers
noisy links that Conduct user to false target.
Relevant answer
Answer
Hi, Taghandiky,
About your question,  you need to be more specific. I don't know if you can open the following link, but it's a research from 2010 with title "Combating Link Spam by Noisy Link Analysis".
There they define noisy link as non-voting link, a link that don't define or give support for the target page it points to. One example, links from the same site (in this papers, they call it "in-links pages"). Anyway, I also linking the reference from Research Gate for this paper.
Anyway, I'm assuming that you are talking about search engines ranking. Can you confirm if you are asking about spamdexing, or give more information, so we can have a better understand of your question?
  • asked a question related to Web Scraping
Question
2 answers
Basically I want to generate a mapping between uris(rdf,rdfs,owl) and NL keywords for a distributed meta-meta information system to assist SPARQL query construction from natural language query using controlled vocabulary like wordnet. For that I have to crawl linked open data to get uris describing entities and it's best matched keyword (e.g Rdf: label).
Relevant answer
Answer
  • asked a question related to Web Scraping
Question
17 answers
 I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads, banners, comments etc.) I want to ensure that when a user saves a page, the data that he wanted to read is saved, and nothing else.
In short, I need to build an application which works just like Readability. ( http://www.readability.com ) I need to take this useful content of the web page and store it in a separate file. I don't really know how to go about it.
I don't want to use API's that need me to connect to the internet and fetch data from their servers as the process of data extraction needs to be done offline.
There are two methods that I could think of:
1. Use a machine learning based algorithm
2. Develop a web scraper that could satisfactorily remove all clutter from web pages.
Is there an existing tool that does this? I came across the boilerpipe library ( http://code.google.com/p/boilerpipe/ ) but didn't use it. Has anybody used it? Does it give satisfactory results? Are there any other tools, particularly written in java which do this kind of web scraping?
If I need to build my own tool to do this, what would you guys suggest to go about it?
Since I'd need to clean up messy or incomplete HTML before I begin its parsing, I'd use a tool like Tidy ( http://www.w3.org/People/Raggett/tidy/ ) or Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ) to do the job.
But I don't know how to extract content after this step.
 Thanks a lot!
Relevant answer
Answer
You may want to look at  the following:
  • asked a question related to Web Scraping
Question
8 answers
I am interested in doing some work in area of semantic web crawling/scraping and using that semantic data to do some discovery.
Relevant answer
Answer
Hi,
Another type of ontology is knowledge graph such as Freebase (https://www.freebase.com/), which allows users to download the weekly data dumps or use API to access the information.
best regards,
  • asked a question related to Web Scraping
Question
10 answers
Hi,
I have some text data that has details of customer name, address etc,..along with some other inforamtion which are not required and I would like to extract the required data and place it in the web page in appropriate fields.
Please let me know which classification holds good and the different types of rules and any file conversion involved in this?
Relevant answer
Answer
Sree,
This would be a factor of the nature of the pattern. Taking a sample as below,
Datset 1: Sree Sindhusruthi 12456 A12872 60 Abe Street Chicago IL 901-123-2456
Dataset 2: 5439 Sree Sindhustruthi A23098 924-167-1234 324 XYZ 90 Bce Street Chicago IL
if all the variations in patterns are known for all datasets, then it would be strightforward to apply regex.
if the variation in patterns are totally random (although I would then question the source of the data and the dataset generation process, as entirely random sequences for a data like "customer data" is very unusual), then a linear approach is - you could do a space delimited sequential split, and examine as below -
If di is a datapoint, and F(d) represents a known pattern (eg., an "address")
F(d)= F(di, di+1, di+2...di+c) -> You examine if a function of successive datapoints result in any of the known patterns identified by F(d), if yes, you mark them and proceed with di+c+1.
Given that the datasets may not be more than 2k, an SVM approach may not really be necessary, and one of the above approaches should provide you results within a reasonable response time.