scrapy-splash | Scrapy+Splash for JavaScript integration | Scraper library
kandi X-RAY | scrapy-splash Summary
kandi X-RAY | scrapy-splash Summary
Scrapy+Splash for JavaScript integration
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Process request
- Convert headers to unicode
- Gets the slot key for the request or response
- Returns whether http auth is enabled
- Set the download slot
- Process a Splash request
- Get cookies from request
- Send a debug message to the logger
- Convert a cookie to a dictionary
- Process a response
- Logs cookies
- Add har_cookies to a cookie jar
- Convert a harpoon cookie object into a cookie object
- Load response from JSON
- Return the options for the splash options
- Convert headers to scipy
- Override robots txt middleware
scrapy-splash Key Features
scrapy-splash Examples and Code Snippets
docker inspect --format '{{ .NetworkSettings.IPAddress }}' $(docker ps -q)
'SPLASH_URL': 'http://0.0.0.0:8050'
# global_state.py
GLOBAL_STATE = {"counter": 0}
# middleware.py
from global_state import GLOBAL_STATE
class SeleniumMiddleware:
def process_request(self, request, spider):
GLOBAL_STATE["counter"] += 1
self.driver.g
>>> response.css('meta[itemprop=image]::attr(content)').get()
'https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jp
_mkt_imageDir = /BASE_IMAGES_URL=(.*?);/.test(document.cookie) && RegExp.$1 || 'https://static.zara.net/photos/';
"originalUrl":"/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115",
json_response = response.xpath('html/body/pre/text()').get()
json_response = json.loads(json_response)
FEEDS = {"overwrite": True}
FEEDS = {
"quotes_splash.json": {
"format": "json",
"overwrite": True
}
}
File "/home/andylu/.virtualenvs/scrapy_course/li
async def parse_page(self, response):
...
for link in links:
request = response.follow(link)
response = await self.crawler.engine.download(request, self)
urls.append(response.css('a::attr(href)').get())
req = FormRequest.from_response(
response,
formid='login-form',
formdata={
'username' : 'not real',
'password' : 'login data'},
clickdata={'type': 'submit'}
lists = response.css('#recent_list_box > li').getAll()
lists = response.css('#recent_list_box > li').getall()
(//span[@class='tv-widget-fundamentals__value apply-overflow-tooltip'])[2]
(//span[@class='tv-widget-fundamentals__value apply-overflow-tooltip'])[4]
Community Discussions
Trending Discussions on scrapy-splash
QUESTION
I am using Scrapy with Splash. Here is what I have in my spider:
...ANSWER
Answered 2021-May-23 at 10:57I ditched the Crawl Spider and converted to a regular spider, and things are working fine now.
QUESTION
I am trying to scrape a product page from ZARA. Like this one :https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115
My scrapy-splash container is running. In the shell I fetch the page
...ANSWER
Answered 2021-May-16 at 19:42I have only started looking into web scraping in the last week, so I am not sure if I can be much help, but I did find something.
The source code showed this in the script at the top:
QUESTION
I have a mini project where there is a list of URLs on the first page and then I have to follow each URL in these list of URLs and open each URL with SplashRequest because I need the returned page to be rendered along with its JavaScript component.
Now, I'm very new to all of these web scraping and scrapy-splash but basically I'm currently stuck because I'm trying to figure out how to pass a variable to the callback function when using SplashRequest. Basically, I have no idea how to pass a variable to our callback function below:
...ANSWER
Answered 2021-May-04 at 14:39Found the answer to this one myself, apparently the SplashRequest also takes meta as its argument just like response.follow so the mechanism of passing variables to the callback function is exactly the same as using a normal scrapy.
QUESTION
I'm trying to scrape a javascript website using scrapy and selenium. I open the javascript website using selenium and a chrome driver and I scrape all the links to different listings from the current page using scrapy and store them in a list (this has been the best way to do it so far as trying to follow links using seleniumRequest and callingback to a parse new page function has caused a lot errors). Then, I loop through the list of URLs, open them in the selenium driver and scrape the info from the pages. So far this scrapes 16 pages/ minute which is not ideal given the amount of listings on this site. I would ideally have the selenium drivers opening links in parallel like the following implementations:
How can I make Selenium run in parallel with Scrapy?
https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b
However, I can't figure out how to implement parallel processing in my selenium-scrapy code. `
...ANSWER
Answered 2021-Feb-09 at 01:10The following sample program creates a thread pool with only 2 threads for demo purposes and then scrapes 4 URLs to get their titles:
QUESTION
ANSWER
Answered 2020-Dec-14 at 18:07QUESTION
I'm creating my first scrapy project with Splash and work with the testdata from http://quotes.toscrape.com/js/
I want to store the quotes of each page as a separate file on disk (in the code below I first try to store the entire page). I have the code below, which worked when I was not using SplashRequest
, but with the new code below, nothing is stored on disk now when I 'Run and debug' this code in Visual Studio Code.
Also self.log
does not write to my Visual Code Terminal window. I'm new to Splash, so I'm sure I'm missing something, but what?
ANSWER
Answered 2020-Oct-19 at 09:09JavaScript on website you wish to scrape isn’t being executed.
SolutionIncrease ScrappyRequest
wait time to allow JavaScript to execute.
QUESTION
I'm trying to get data from this website using scrapy-splash but im not able to extract data. I want to get data about each real state like href, price, etc. Here is my code:
in setings.py:
...ANSWER
Answered 2020-Oct-06 at 08:45What I would do instead is this:
Send a request to https://www.metrocuadrado.com/results/_next/static/chunks/commons.8afec6af6d5add2097bf.js, in the response you'll find an API-key if you search for "X-Api-Key". So that can be extracted easily with regex, something like: re.findall(r'"X-Api-Key":"(\w+)"')
.
Then, when you've extracted the API key, send a request to https://www.metrocuadrado.com/rest-search/search?seo=/bodega/arriendo&from=0&size=50, which is the hidden API in the website you sent. To get a valid response you have to attach the header like this
QUESTION
Since i am not able to login to https://www.duif.nl/login, i tried many different methods like selenium, which i successfully logged in, but didnt manage to start crawling.
Now i tried my luck with scrapy-splash, but i cant login :(
If i render the loginpage with splash, i see following picture:
Well, there should be a loginform, like username and password, but scrapy cant see it?
Im sitting here like a week in front of that loginform and losing my will to live..
My last question didnt even get one answer, now i try it again.
here is the html code of the login-form:
When i login manual, i get redirected to "/login?returnUrl=", where i only have these form_data:
My Code
...ANSWER
Answered 2020-Sep-23 at 07:35- I don't think using Splash here is the way to go, as even with a normal Request the form is there:
response.xpath('//form[@id="login-form"]')
- There are multiple forms available on the page, so you have to specify which form you want to base yourself on to make a FormRequest.from_response. Best specify the clickdata as well (so it goes to 'Login', not to 'forgot password'). In summary it would look something like this:
QUESTION
I am trying to using scrapy shell to get in "ykc1.greatwestlife.com" which should be a public website, though there are lots of things if I look at page source manually, I can not get a correct response using scrapy.
Do I need to use scrapy-splash in this case? any ideas? Thanks
...ANSWER
Answered 2020-Aug-08 at 21:57You can actually see the two back-to-back requests, caused by
QUESTION
I am trying to scrap a page that have a list of elements and at the bottom a expand button that increases the list. It uses a onclick event to expand and I don't know how to activate it. I'm trying to use scrapy-splash since I read it might work, but I can't make it function properly.
What I am currently trying to do is something like this
...ANSWER
Answered 2020-Aug-03 at 09:27It's not necessary to use Splash, if you look at the network tools of chromedevtools. It's making a get HTTP request with some parameters. This is called re-engineering the HTTP requests and is preferable to using splash/selenium. Particularly if you're scraping a lot of data.
In cases of re-engineering the request copying the BASH request and putting this into curl.trillworks.com. This gives me a nice formated headers, parameters and cookies for that particular request. I usually play about with this HTTP request using the requests python package. In this case, the simplest HTTP request is one where you just have to pass the parameters and not the headers.
If you look on the right hand side you have headers and parameters. Using the reuqests package I figured out that you only need to pass the page parameters to get the information you needed.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install scrapy-splash
You can use scrapy-splash like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page