alltheplaces | A set of spiders and scrapers to extract location | Scraper library
kandi X-RAY | alltheplaces Summary
kandi X-RAY | alltheplaces Summary
A set of spiders and scrapers to extract location information from places that post their location on the internet.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Parse and return a list of shop objects
- Join a list of addresses together
- Join the fields of src with the given fields
- Extract details from a website
- Parse GitHub API response
- Get the opening hours
- Add a range to the time range
- Sanitise a day
- Parse major API response
- Parse the shop information from the shop response
- Parse the BeautifulSoup response
- Parse the shop response
- Parse the opening hours
- Parse the offices
- Parse GMap response
- Parses the response from the API
- Parses the website address
- Parse the response from the API
- Parses the response
- Parse request response
- Parse a beacon response
- Parse the response from shell
- Parse the response
- Parse the store
- Parse the Firestore response
- Parse a waitrose response
alltheplaces Key Features
alltheplaces Examples and Code Snippets
Community Discussions
Trending Discussions on alltheplaces
QUESTION
I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess
, add the spiders to it, and hit start()
.
When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.
What is the recommended way to run a large number of spiders with Scrapy?
Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.
...ANSWER
Answered 2018-Jan-04 at 04:18it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it
That's probably a sign that you need multiple machines to execute your spiders. A scalability issue. Well, you can also scale vertically to make your single machine more powerful but that would hit a "limit" much faster:
Check out the Distributed Crawling documentation and the scrapyd
project.
There is also a cloud-based distributed crawling service called ScrapingHub which would take away the scalability problems from you altogether (note that I am not advertising them as I have no affiliation to the company).
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install alltheplaces
This project uses pipenv to handle dependencies and virtual environments. To get started, make sure you have pipenv installed.
With pipenv installed, make sure you have the all-the-places repository checked out git clone git@github.com:alltheplaces/alltheplaces.git
Then you can install the dependencies for the project cd alltheplaces pipenv install
After dependencies are installed, make sure you can run the scrapy command without error pipenv run scrapy
If pipenv run scrapy ran without complaining, then you have a functional scrapy setup and are ready to write a scraper.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page