Questions related to Web Scraping
I am interested in analyzing some features of editor and reviewer comments during the peer review process. There doesn't appear to be an option to limit search results to articles that include reviewer reports in Scopus/Google Scholar/Web of Science, or even PLoS ONE. Does anybody know a workaround that doesn't involve web scraping?
(I know some journals like BMC publish reviewer reports as a standard practice, but I would like to search multiple journals that offer, but do not require, open peer review.)
Recently I am trying to scrape yahoo finance financial tables for my Ph.D. study. I am using R and rvest package but te I realized I cannot scrape all the data on the tables, because of the "expand all" and "collapse all" buttons. Then I found out it is possible to click those buttons with RSelenium.
The button is indicated as "<div class="expandPf Fz(s)...."
my code is;
webElem <- remDr$findElement(using = "class", ".expandPf")
but I am getting error. I tried to it with "CSS" as well but same error message.
Does anyone know how to fix it?
I am currently doing a web scraping study about the online prices of laptops. My objectives are
- to test if there is a significant difference in online prices of different laptop brands on different days of the month;
- to determine if there is a significant difference in online prices of different laptop ranges on different days of the month; and
- to test if there is a significant difference in online prices of laptop ranges per brand on different days of the month
I have already finished scraping the online prices of laptops last week. I am planning to use a two-way and three-way ANOVA to answer the objectives but as I was checking the assumptions of ANOVA, I have noticed that the homogeneity of variance is violated. I have used Levene's Test in checking the homogeneity of variance. Is it still okay to proceed in using two-way and three-way ANOVA?
I want to dive deeper into data analytics, I have done quite a few basic projects e.g get a random dataset, clean/analysis it, and then make visualizations. However I want to conduct a more challenging project, e.g some web scraping, maybe using SQL and Python to analyze data and visualize it, to help solve some real-world potential scenarios. But as a typical computer science student, my brain isn't creative and struggling to think of valid ideas. One idea I had is to try to gather uber driving data at my uni and try to store this data into a SQL database and do some cleaning and analysis and try to visualize more busy spots etc.
Would be much appreciated to gather some ideas from those established in this field! thank you.
Dear Community , in order to analyze a component website and its features ; I want to extract some data from the website , it's first time to hear about the web scraping, can you guide me please , how to do it using python for competition intelligence?
Illegal wildlife trade (IWT) in cybermarkets is a growing issue around the world. Scientists, therefore, are adapting strategies to a new paradigm regarding biodiversity conservation at the internet speed: we need analytical methods in common, sensitize stakeholders, and join efforts to maximize our actions.
I seek for potential illegalities regarding the trade of Brazilian species on the internet and mostly use Web Scraper to sample pages and their content - which I train my "digital robots" to catch.
My main goals are to find, monitor, and expose these marketplaces in order to neutralize advertisements and sensitize public opinion on the risks of digital biopiracy (mostly IWT through internet).
If you know any other method for capturing information on the internet, please, let me know.
Thank you for your help.
Using requests in python we can get json files from different webpages. However, in some cases due to Cloud flare or different cookies, we cannot get the json files. On the other hand, some extensions of different browsers can get any jsons of a webpage. Can we connect a python script to a browser extension to get the json file?
If yes, how?
Thanks in advanced
VirusTotal website is a tool for detecting malware Android apps. But it also gives various information related to the APK uploaded, like manifest tags, permissions etc. I want to extract these information for multiple APKs. How to scrap it and extract this data for multiple APKs ?
I'm writing a code to scrape email addresses using 'scrapy' to search prospects for SEO link building.
I'm facing difficulties to filter out the unwanted outputs that I'm getting.
Presently, I can see 2 possibilities to proceed:
1) Manipulate the text by modifying 'Pattern Matching with Regular Expressions' using re.findall() function
2) Rejecting the unwanted texts from email addresses using 'reject' method.
Can anyone be able to help me with the code?
- Aashay Raut
For the most part, I have found papers by following researchers on twitter and through blog posts, but I was wondering how other people keep informed. What kinds of methods have helped you? Apps? Reading lists? Daily schedule? Web scraping?
I will need to scrape Instagram for public posts related to a specific hashtag as data for a content and visual analysis that is part of my project.
Now, as I understand it, scraping data for academic purposes are legal (and ethical if done right) - here in Norway, and in the US (where Instagram is situated). However, instagram's TOS states that "You can't attempt to create accounts or access or collect information in unauthorized ways. This includes creating accounts or collecting information in an automated way without our express permission.
My question is: Is it still possible for me as a researcher to scrape Instagram, or does this TOS-point weigh too heavily against it?
I am new to python and am making a web scraping program to automatic collect information about scholarship for me and I want to share it with my friends.
It has few dependencies.
It run perfectly on Pycharm and command line.
But when I try to convert it to .exe using pyinstaller . It just blink black screen. i put it on the same file that have dependencies. It doesn't work either.
I need to scrape as much as I can for my PhD research, my research area in health communication, it investigates the role of mediated communication in public health, specifically focusing on anti-vaccine issue as a comparative study on vaccination messages in KSA & AUS, I will focus on one media platform: Twitter or Facebook.
I have used a scraper tool to collect data from twitter, I started with some hashtags for KSA, and added a few more hashtags I found. I noticed that some hashtags are used for spams, I tried to clean the data from spam as much as we can, but I may still find some spam tweets.
At the same time, I have found bad news, as Facebook and Instagram are banning anti-vaccination content, and seems like twitter is starting to do the same, a lot of the hashtags I'm trying in English but have very few and bad results, even if I'm not focusing on KSA or Australia as you see in this link:
As a result of that, I am facing two problems: How can I determine country in scraping data? and How can I translate data from Arabic to English for analysis as I will use lexomancer, and it does not work with Arabia content?
I need to be collecting as much data NOW as I can, so could you have any helpful advise in that please?
Many websites say in their terms that use of anything (text, pictures, etc) on their site is prohibited because it is their intellectual property. Does anyone know if it is actually illegal or legal to web scrape data from websites to use in research? Do I need to get permission from each individual website I want to scrape? Does the data need to be "anonymous" when published (i.e. someone can't determine which website it came from)?
I want to extract the data from the internet with specific medical terms. Which are the available tools with best accuracy or output?
need some help in using beautifulsoup library for webscrapping.
I need to extract the text from the webpage http://thehill.com/blogs/blog-briefing-room/365407-sean-diddy-combs-wants-to-buy-carolina-panthers-and-sign-kaepernick
my goal is to get the extract text exactly as i the webpage for which I a extracting all the "p" tags and its text, but inside "p" tags there are "a" tags which has also some text.
so my questions:
1. how to convert the unicoded ("") into normal strings as the text in the webpage? because when I only extract "p" tags, the beautifulsoup library converts the text into unicoded and even the special characters are unicoded, so I want to convert the extracted unicoded text into normal text. How could I do that?
2. How to extract the text inside "p" tags which has "a" tags in it. I mean I would like to exract the complete text inside the "p" tags including the text inside nested tags.
I have tried with the following code:
html = requests.get("http://thehill.com/blogs/blog-briefing-room/365407-sean-diddy-combs-wants-to-buy-carolina-panthers-and-sign-kaepernick").content
news_soup = BeautifulSoup(html, "html.parser")
a_text = news_soup.find_all('p')
y = a_text.find_all('a').string
Basically I want to generate a mapping between uris(rdf,rdfs,owl) and NL keywords for a distributed meta-meta information system to assist SPARQL query construction from natural language query using controlled vocabulary like wordnet. For that I have to crawl linked open data to get uris describing entities and it's best matched keyword (e.g Rdf: label).
I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads, banners, comments etc.) I want to ensure that when a user saves a page, the data that he wanted to read is saved, and nothing else.
In short, I need to build an application which works just like Readability. ( http://www.readability.com ) I need to take this useful content of the web page and store it in a separate file. I don't really know how to go about it.
I don't want to use API's that need me to connect to the internet and fetch data from their servers as the process of data extraction needs to be done offline.
There are two methods that I could think of:
1. Use a machine learning based algorithm
2. Develop a web scraper that could satisfactorily remove all clutter from web pages.
Is there an existing tool that does this? I came across the boilerpipe library ( http://code.google.com/p/boilerpipe/ ) but didn't use it. Has anybody used it? Does it give satisfactory results? Are there any other tools, particularly written in java which do this kind of web scraping?
If I need to build my own tool to do this, what would you guys suggest to go about it?
Since I'd need to clean up messy or incomplete HTML before I begin its parsing, I'd use a tool like Tidy ( http://www.w3.org/People/Raggett/tidy/ ) or Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ) to do the job.
But I don't know how to extract content after this step.
Thanks a lot!
I am interested in doing some work in area of semantic web crawling/scraping and using that semantic data to do some discovery.
I have some text data that has details of customer name, address etc,..along with some other inforamtion which are not required and I would like to extract the required data and place it in the web page in appropriate fields.
Please let me know which classification holds good and the different types of rules and any file conversion involved in this?