scrapy-redis | Redis-based components for Scrapy | Crawler library
kandi X-RAY | scrapy-redis Summary
kandi X-RAY | scrapy-redis Summary
Redis-based components for Scrapy.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Creates a new DUPE filter from the given settings
- Return a redis instance from the given settings
- Return an instance of the Redis client
- Get stats for a given spider
- Generate a key for the stats key
- Convert bytes to str
- Create an object from a crawler
- Setup redis connection
- Serialize an item
- Return the key for a spider
- Push a request to the queue
- Encode the request
- Remove the item from the queue
- Decode a request
- Create a RedisClient from a spider
- Close the stream
- Remove an item from the server
- Remove an item from the queue
- Read rst file
- Close a spider
- Return an instance of redis client
- Process items from the Redis queue
scrapy-redis Key Features
scrapy-redis Examples and Code Snippets
$ git clone https://github.com/KDF5000/RSpider.git
# 修改scrapy默认的调度器为scrapy重写的调度器 启动从reids缓存读取队列调度爬虫
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 调度状态持久化,不清理redis缓存,允许暂停/启动爬虫
SCHEDULER_PERSIST = True
# 请求调度使用优先队列(默认)
#SCHEDULER_QUEUE_CLASS = 's
CREATE TABLE `house` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(50) DEFAULT NULL,
`price` varchar(50) DEFAULT NULL,
`open_date` varchar(50) DEFAULT NULL,
`address` varchar(255) DEFAULT NULL,
`lon_lat` varchar(50) DEFAULT NULL,
python3 -m pip install scrapy-redis-expiredupefilter
# 使用支持 TTL DUPEFILTER 调度器
SCHEDULER = 'scrapy_redis_expiredupefilter.scheduler.Scheduler'
# 带有 TTL 的 DUPEFILTER
DUPEFILTER_CLASS = 'scrapy_redis_expiredupefilter.dupefilter.RFPDupeFilter'
# REDIS连
Community Discussions
Trending Discussions on scrapy-redis
QUESTION
Sorry to disturb you guys. This is bad question, seems what really confused me is how ItemPipeline works in scrapy. I'll close it and start a new question.
Where should I bind the db/redis connection to on scrapy, Spider
or Pipeline
.
In the scrapy document, mongo db connection is bind on Pipeline
. But it could be also be bound to the Spider
(It's also what extension scrapy-redis
does). The later solution brings the benefit that the spider is accessible in more places besides pipeline, like middlewares.
So, which is the better way to do it?
I'm confused about that pipelines are run in parallel (this is what the doc says). Does it mean there're multiple instances of MyCustomPipeline
?
Besides, connection pool of redis/db is preferred?
I just lack the field experience to make the decision. Need your help. Thanks in advance.
...As the doc says, ItemPipeline is run in parallel. How? Are there duplicate instances of the ItemPipeline run in threads. (I noticed FilesPipeline uses deferred thread to save files into s3). Or there's only one instance of each pipeline and runs in the main event loop. If it's the later case, the connection pool doesn't seems to help. Cause when you use a redis connection, it's blocked. Only one connection could be used at the same time.
ANSWER
Answered 2020-Jul-11 at 13:07The best practice would be to bind the connection in the pipelines, in order to follow with the separation of concerns principle.
Scrapy uses the same parallelism infrastructure for executing requests and processing items, as your spider yields items, scrapy will call the process_item
method from the pipeline instance. Check it here.
A single instance of every pipeline is instantiated during the spider instantiation.
Besides, connection pool of redis/db is preferred?
Sorry, don't think I can help with this one.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install scrapy-redis
You can use scrapy-redis like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page