scrapy | fast high-level web crawling | Crawler library

 by   scrapy Python Version: 2.11.1 License: BSD-3-Clause

kandi X-RAY | scrapy Summary

kandi X-RAY | scrapy Summary

scrapy is a Python library typically used in Automation, Crawler, Selenium applications. scrapy has build file available, it has a Permissive License and it has high support. However scrapy has 47 bugs and it has 6 vulnerabilities. You can install using 'pip install scrapy' or download it from GitHub, PyPI.

Scrapy, a fast high-level web crawling & scraping framework for Python.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              scrapy has a highly active ecosystem.
              It has 47503 star(s) with 10019 fork(s). There are 1783 watchers for this library.
              There were 1 major release(s) in the last 6 months.
              There are 482 open issues and 2330 have been closed. On average issues are closed in 299 days. There are 255 open pull requests and 0 closed requests.
              It has a positive sentiment in the developer community.
              The latest version of scrapy is 2.11.1

            kandi-Quality Quality

              OutlinedDot
              scrapy has 47 bugs (1 blocker, 3 critical, 31 major, 12 minor) and 588 code smells.

            kandi-Security Security

              scrapy has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              OutlinedDot
              scrapy code analysis shows 6 unresolved vulnerabilities (1 blocker, 0 critical, 5 major, 0 minor).
              There are 1523 security hotspots that need review.

            kandi-License License

              scrapy is licensed under the BSD-3-Clause License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              scrapy releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              scrapy saves you 17074 person hours of effort in developing the same functionality from scratch.
              It has 33894 lines of code, 3646 functions and 333 files.
              It has low code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed scrapy and discovered the below as its top functions. This is intended to give you an instant insight into scrapy implemented functionality, and help decide if they suit your requirements.
            • Called when the response is ready
            • Return a list of values
            • Creates headers from a twisted response
            • Update the values in seq
            • Create a deprecated class
            • Return the path to the class
            • Check if the given subclass is a subclass of the subclass
            • Recursively follow requests
            • Parse a selector
            • Execute scrapy
            • Handle data received from the crawler
            • Create a subclass of ScrapyRequestQueue
            • Called when an item processor is dropped
            • Download robots txt file
            • Log download errors
            • Callback function for verifying SSL connection
            • Start the crawler
            • Follow given URLs
            • Return media to download
            • Return whether a cached response is fresh
            • Runs text tests
            • Follow a URL
            • Returns a list of request headers
            • Called when the request is downloaded
            • Call download function
            • Configure logging
            Get all kandi verified functions for this library.

            scrapy Key Features

            No Key Features are available at this moment for scrapy.

            scrapy Examples and Code Snippets

            Compile the documentation
            Pythondot img1Lines of Code : 0dot img1License : Non-SPDX (NOASSERTION)
            copy iconCopy
            make html  
            Setup the environment
            Pythondot img2Lines of Code : 0dot img2License : Non-SPDX (NOASSERTION)
            copy iconCopy
            pip install -r requirements.txt  
            Recreating documentation on the fly
            Pythondot img3Lines of Code : 0dot img3License : Non-SPDX (NOASSERTION)
            copy iconCopy
            make watch  
            how to extract a specific text with no tag by python scrapy?(new problem)
            Pythondot img4Lines of Code : 35dot img4License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            import scrapy
            from scrapy.crawler import CrawlerProcess
            class MmSpider(scrapy.Spider):
                name = 'name'
                start_urls = ['https://www.timeout.com/film/best-movies-of-all-time']
            
                def parse(self, response):
                    for title in respons
            Passing Variables to Scrapy
            Pythondot img5Lines of Code : 14dot img5License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
            
            class MySpider(Spider):
                name = 'myspider'
                ...
                def parse(self, response):
                    ...
                    if self.parameter1 == value1:
                        # t
            Scrape endpoint with Basic authentication
            Pythondot img6Lines of Code : 2dot img6License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            request.setRequestHeader("Authorization", "Basic "+btoa("apiInfoelectoral:apiInfoelectoralPro"));
            
            copy iconCopy
            names_to_search = []
            
            def get_names_to_search():
                # open file to read
                file = open ("cegek.txt", "r")
                # read lines in file
                lines = file.readlines()
                # loop through file and append names to list
                for line in lines:
                 
            Scrapy - ReactorAlreadyInstalledError when using TwistedScheduler
            Pythondot img8Lines of Code : 49dot img8License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            from multiprocessing import Process
            
            from scrapy.crawler import CrawlerRunner
            from scrapy.utils.project import get_project_settings
            from scrapy.utils.log import configure_logging
            from apscheduler.schedulers.blocking import BlockingSchedule
            Scrapy - ReactorAlreadyInstalledError when using TwistedScheduler
            Pythondot img9Lines of Code : 26dot img9License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            
            from scrapy.crawler import CrawlerRunner
            from scrapy.utils.project import get_project_settings
            from scrapy.utils.log import configure_logging
            from twisted.internet import reactor
            from apscheduler.schedulers.twisted import TwistedScheduler
            How I can crawl a table's all data with scrapy
            Pythondot img10Lines of Code : 67dot img10License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            from ast import parse
            from fileinput import filename
            import scrapy
            from scrapy.crawler import CrawlerProcess
            
            class PostsSpider(scrapy.Spider):
                name = "posts"
            
                start_urls= ['https://publicholidays.com.bd/2022-dates']
                
                def p

            Community Discussions

            QUESTION

            How to correclty loop links with Scrapy?
            Asked 2022-Mar-03 at 09:22

            I'm using Scrapy and I'm having some problems while loop through a link.

            I'm scraping the majority of information from one single page except one which points to another page.

            There are 10 articles on each page. For each article I have to get the abstract which is on a second page. The correspondence between articles and abstracts is 1:1.

            Here the divsection I'm using to scrape the data:

            ...

            ANSWER

            Answered 2022-Mar-01 at 19:43

            The link to the article abstract appears to be a relative link (from the exception). /doi/abs/10.1080/03066150.2021.1956473 doesn't start with https:// or http://.

            You should append this relative URL to the base URL of the website (i.e. if the base URL is "https://www.tandfonline.com", you can

            Source https://stackoverflow.com/questions/71308962

            QUESTION

            Scrapy exclude URLs containing specific text
            Asked 2022-Feb-24 at 02:49

            I have a problem with a Scrapy Python program I'm trying to build. The code is the following.

            ...

            ANSWER

            Answered 2022-Feb-24 at 02:49

            You have two issues with your code. First, you have two Rules in your crawl spider and you have included the deny restriction in the second rule which never gets checked because the first Rule follows all links and then calls the callback. Therefore the first rule is checked first and therefore it does not exclude the urls you don't want to crawl. Second issue is that in your second rule, you have included the literal string of what you want to avoid scraping but deny expects regular expressions.

            Solution is to remove the first rule and slightly change the deny argument by escaping special regex characters in the url such as -. See below sample.

            Source https://stackoverflow.com/questions/71224474

            QUESTION

            Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
            Asked 2022-Jan-22 at 16:39

            I have the following scrapy CrawlSpider:

            ...

            ANSWER

            Answered 2022-Jan-22 at 16:39

            Taking a stab at an answer here with no experience of the libraries.

            It looks like Scrapy Crawlers themselves are single-threaded. To get multi-threaded behavior you need to configure your application differently or write code that makes it behave mulit-threaded. It sounds like you've already tried this so this is probably not news to you but make sure you have configured the CONCURRENT_REQUESTS and REACTOR_THREADPOOL_MAXSIZE.

            https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize

            I can't imagine there is much CPU work going on in the crawling process so i doubt it's a GIL issue.

            Excluding GIL as an option there are two possibilities here:

            1. Your crawler is not actually multi-threaded. This may be because you are missing some setup or configuration that would make it so. i.e. You may have set the env variables correctly but your crawler is written in a way that is processing requests for urls synchronously instead of submitting them to a queue.

            To test this, create a global object and store a counter on it. Each time your crawler starts a request increment the counter. Each time your crawler finishes a request, decrement the counter. Then run a thread that prints the counter every second. If your counter value is always 1, then you are still running synchronously.

            Source https://stackoverflow.com/questions/70647245

            QUESTION

            How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
            Asked 2022-Jan-20 at 15:35

            I am working on certain stock-related projects where I have had a task to scrape all data on a daily basis for the last 5 years. i.e from 2016 to date. I particularly thought of using selenium because I can use crawler and bot to scrape the data based on the date. So I used the use of button click with selenium and now I want the same data that is displayed by the selenium browser to be fed by scrappy. This is the website I am working on right now. I have written the following code inside scrappy spider.

            ...

            ANSWER

            Answered 2022-Jan-14 at 09:30

            The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.

            Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):

            Source https://stackoverflow.com/questions/70651053

            QUESTION

            Scrapy display response.request.url inside zip()
            Asked 2021-Dec-22 at 07:59

            I'm trying to create a simple Scrapy function which will loop through a set of standard URLs and pull their Alexa Rank. The output I want is just two columns: One showing the scraped Alexa Rank, and one showing the URL which was scraped.

            Everything seems to be working except that I cannot get the scraped URL to display correctly in my output. My code currently is:

            ...

            ANSWER

            Answered 2021-Dec-22 at 07:59

            Here zip() takes 'rank' which is a list and 'url_raw' which is a string so you get a character from 'url_raw' for each iteration.

            Solution with cycle:

            Source https://stackoverflow.com/questions/70440363

            QUESTION

            Yielding values from consecutive parallel parse functions via meta in Scrapy
            Asked 2021-Dec-20 at 07:53

            In my scrapy code I'm trying to yield the following figures from parliament's website where all the members of parliament (MPs) are listed. Opening the links for each MP, I'm making parallel requests to get the figures I'm trying to count. I'm intending to yield each three figures below in the company of the name and the party of the MP

            Here are the figures I'm trying to scrape

            1. How many bill proposals that each MP has their signature on
            2. How many question proposals that each MP has their signature on
            3. How many times that each MP spoke on the parliament

            In order to count and yield out how many bills has each member of parliament has their signature on, I'm trying to write a scraper on the members of parliament which works with 3 layers:

            • Starting with the link where all MPs are listed
            • From (1) accessing the individual page of each MP where the three information defined above is displayed
            • 3a) Requesting the page with bill proposals and counting the number of them by len function 3b) Requesting the page with question proposals and counting the number of them by len function 3c) Requesting the page with speeches and counting the number of them by len function

            What I want: I want to yield the inquiries of 3a,3b,3c with the name and the party of the MP in the same raw

            • Problem 1) When I get an output to csv it only creates fields of speech count, name, part. It doesn't show me the fields of bill proposals and question proposals

            • Problem 2) There are two empty values for each MP, which I guess corresponds to the values I described above at Problem1

            • Problem 3) What is the better way of restructuring my code to output the three values in the same line, rather than printing each MP three times for each value that I'm scraping

            ...

            ANSWER

            Answered 2021-Dec-18 at 06:26

            This is happening because you are yielding dicts instead of item objects, so spider engine will not have a guide of fields you want to have as default.

            In order to make the csv output fields bill_prop_count and res_prop_count, you should make the following changes in your code:

            1 - Create a base item object with all desirable fields - you can create this in the items.py file of your scrapy project:

            Source https://stackoverflow.com/questions/70399191

            QUESTION

            Parsing an 'Load More' response with HTML content
            Asked 2021-Dec-12 at 09:10

            I'm trying to scrape each content in Istanbul Governorate's announcement section located at the link below, which loads content with a 'Load More' at the bottom of the page. From dev tools / Network, I checked properties of the POST request sent and updated the header accordingly. The response apparently is not json but an html code.

            I would like to yield the parsed html responses but when I crawl it, it just doesn't return anything and stuck with the first request forever. Thank you in advance.

            Could you explain me what's wrong with my code? I checked tens of questions here but couldn't resolve the issue. As I understand, it just can't parse the response html but I couldn't figure out why.

            ps: I have been enthusiastically into Python and scraping for 20 days. Forgive my ignorance.

            ...

            ANSWER

            Answered 2021-Dec-12 at 09:10
            1. Remove Content-Length, also never include it in the headers. Also you should remove the cookie and let scrapy handle it.

            2. Look at the request body and recreate it for every page:

            3. You need to know when to stop, in this case it's an empty page.

            4. in the bilgi.xpath part you're getting the same line over and over because you forgot a dot at the beginning.

            The complete working code:

            Source https://stackoverflow.com/questions/70319472

            QUESTION

            loop through multiple URLs to scrape from a CSV file in Scrapy is not working
            Asked 2021-Dec-01 at 18:53

            When i try to execute this loop i got error please help i wanted to scrape multiple links using csv file but is stucks in start_urls i am using scrapy 2.5 and python 3.9.7

            ...

            ANSWER

            Answered 2021-Nov-09 at 17:07

            The error you received is rather straightforward; a numpy array doesn't have a to_list method.

            Instead you should simply iterate over the numpy array:

            Source https://stackoverflow.com/questions/69902187

            QUESTION

            During recursive scraping in scrapy, how extract info from multiple nodes of parent url and associated children url together?
            Asked 2021-Nov-22 at 13:09

            The parent url got multiple nodes (quotes), each parent node got child url (author info). I am facing trouble linking the quote to author info, due to asynchronous nature of scrapy?

            How can I fix this issue, here's the code so far. Added # <--- comment for easy spot.

            ...

            ANSWER

            Answered 2021-Nov-22 at 13:09

            Here is the minimal working solution. Both type of pagination is working and I use meta keyword to transfer quote item from one response to another.

            Source https://stackoverflow.com/questions/70062567

            QUESTION

            How to set class variable through __init__ in Python?
            Asked 2021-Nov-08 at 20:06

            I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method, but I could not figure out how to change the class varible "delay" from within the init method.

            Example minimal:

            ...

            ANSWER

            Answered 2021-Nov-08 at 20:06

            I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method...
            ...
            scrapy crawl test -a delay=5

            1. According to scrapy docs. (Settings/Command line options section) it is requred to use -s parameter to update setting
              scrapy crawl test -s DOWNLOAD_DELAY=5

            2. It is not possible to update settings during runtime in spider code from init or other methods (details in related discussion on github Update spider settings during runtime #4196

            Source https://stackoverflow.com/questions/69882916

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install scrapy

            You can install using 'pip install scrapy' or download it from GitHub, PyPI.
            You can use scrapy like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install Scrapy

          • CLONE
          • HTTPS

            https://github.com/scrapy/scrapy.git

          • CLI

            gh repo clone scrapy/scrapy

          • sshUrl

            git@github.com:scrapy/scrapy.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link