scrapy | Scrapy:网站爬虫框架库 | Crawler library
kandi X-RAY | scrapy Summary
kandi X-RAY | scrapy Summary
Scrapy:网站爬虫框架库
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- process a GMS image
- Parse the GdItem response .
- return image path
- Process a request .
- Process an exception .
- Process start requests .
- Initialize MySQL connection
- Process result from spider .
- get media requests
scrapy Key Features
scrapy Examples and Code Snippets
Community Discussions
Trending Discussions on scrapy
QUESTION
I want to submit the form with the 5 data that's on the below. By submitting the form, I can get the redirection URL. I don't know where is the issue. Can anyone help me to submit the form with required info. to get the next page URL.
Code for your reference:
...ANSWER
Answered 2021-Jun-16 at 01:24Okay, this should do it.
QUESTION
I will explain the goal in more detail, The point of the script is to check (product code)values in column A on a supplier website, if the product is available, the loop checks the next value.
If the product is not on the site, a JSON PUT request is sent to a different sales website that sets the inventory level at 0.
The issue is how to assign the value in column B of the same CSV file to the PUT request
CSV file
...ANSWER
Answered 2021-Jun-14 at 13:45From scrapy’s documentation Passing additional data to callback functions, you basically want to pass the code to the data
callback in Request’s cb_kwargs
argument,
To get all codes, you could iterate on (COL-A, COL-B) pairs, not simply on COL-A values. Here we return the 2d numpy array, thus the list of rows, where each row is the COL-A
, COL-B
pair:
QUESTION
I have a web page
https://myeplanning.oxfordshire.gov.uk/Disclaimer?returnUrl=%2FSearch%2FAdvanced
that contain a Accept button. If I press the button it will be redirected to the another page
https://myeplanning.oxfordshire.gov.uk/Search/Advanced
I want to get the redirected URL
, without using selenium that can be done using scrapy
.
Can another give raw code to do.
...ANSWER
Answered 2021-Jun-14 at 09:28This is a and
has to be in
which should have
action="URL"
You have this url
in
QUESTION
I am new to Scrapy. I tried scraping this football data: related website
I wanted to get player position of each player there are 25 players in the table but I am getting 25 empty list
Below is my css selector
...ANSWER
Answered 2021-Jun-13 at 05:34You use too many elements in css
- you should use something simpler because some elements may exists in browser's DOM tree (which it shows in DevTools) but not in real HTML (which you get from server). ie. tbody
usually doesn't exist in HTML
This gives me results
QUESTION
I'd like to give a shot to using Scrapy contracts, as an alternative to full-fledged test suites.
The following is a detailed description of the steps to duplicate.
In a tmp
directory
ANSWER
Answered 2021-Jun-12 at 00:19With @url http://www.amazon.com/s?field-keywords=selfish+gene
I get also error 503
.
Probably it is very old example - it uses http
but modern pages use https
- and amazone
could rebuild page and now it has better system to detect spamers/hackers/bots and block them.
If I use @url http://toscrape.com/
then I don't get error 503
but I still get other error FAILED
because it needs some code in parse()
@scrapes Title Author Year Price
means it has to return item with keys Title Author Year Price
QUESTION
I am trying to loop over a Xpath in Scrapy which looks like this:
...ANSWER
Answered 2021-Jun-10 at 14:05this XPath:
'normalize-space(//div[@id="Content"]//div[@id="programDetails"]//div[@id="selfReportedProgramDetails"]//div[@id="hoursOfOperation"]//span[@class="hoursItem"]//span[@class="times"]/text())')
:
will give only one result because of the normalize-space()
function with all whitespace collapsed.
So to get the actual text-nodes for those spans remove the normalize-space around your XPath.
The second XPath starts with double slash, meaning, it will search from the root all nodes. To search from current context use the .
for more info on // vs .// see this good answer
QUESTION
I am currently building a small test project to learn how to use crontab
on Linux (Ubuntu 20.04.2 LTS).
My crontab file looks like this:
* * * * * sh /home/path_to .../crontab_start_spider.sh >> /home/path_to .../log_python_test.log 2>&1
What I want crontab to do, is to use the shell file below to start a scrapy project. The output is stored in the file log_python_test.log.
My shell file (numbers are only for reference in this question):
...ANSWER
Answered 2021-Jun-07 at 15:35I found a solution to my problem. In fact, just as I suspected, there was a missing directory to my PYTHONPATH. It was the directory that contained the gtts package.
Solution: If you have the same problem,
- Find the package
I looked at that post
- Add it to sys.path (which will also add it to PYTHONPATH)
Add this code at the top of your script (in my case, the pipelines.py):
QUESTION
How could I use a global defined variable (pandas data frame) df
within a scrapy-spider?
ANSWER
Answered 2021-Jun-07 at 07:37You need to declare variable inside class, if you want to initialize do that in constructor.
QUESTION
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
class DatasetItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
class MyFilesPipeline(FilesPipeline):
pass
class DatasetSpider(scrapy.Spider):
name = 'Dataset_Scraper'
url = 'https://kern.humdrum.org/cgi-bin/browse?l=essen/europa/deutschl/allerkbd'
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 7.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
custom_settings = {
'FILES_STORE': 'Dataset',
'ITEM_PIPELINES':{"/home/LaxmanMaharjan/dataset/MyFilesPipeline":1}
}
def start_requests(self):
yield scrapy.Request(
url = self.url,
headers = self.headers,
callback = self.parse
)
def parse(self, response):
item = DatasetItem()
links = response.xpath('.//body/center[3]/center/table/tr[1]/td/table/tr/td/a[4]/@href').getall()
for link in links:
item['file_urls'] = [link]
yield item
break
if __name__ == "__main__":
#run spider from script
process = CrawlerProcess()
process.crawl(DatasetSpider)
process.start()
...ANSWER
Answered 2021-Jun-05 at 18:16In case if pipeline code, spider code and process launcher stored in the same file
You can use __main__
in path to enable pipeline:
QUESTION
EDIT
As Georgiy suggested, I tried to yield dict
instead of Item
and the results are the same.
EDIT END
Trying to export Scrapy output to a JSON file. An item should have this format:
...ANSWER
Answered 2021-Jun-04 at 14:40Not sure about usage of Item classes here for this.. nested items.
The fastest way to achieve this is to yield
dictionaries (not Item class objects):
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
Install scrapy
You can use scrapy like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page