scrapy | fast high-level web crawling | Crawler library

 by   scrapy Python Version: 2.8.0 License: BSD-3-Clause

kandi X-RAY | scrapy Summary

scrapy is a Python library typically used in Automation, Crawler, Selenium applications. scrapy has build file available, it has a Permissive License and it has high support. However scrapy has 47 bugs and it has 6 vulnerabilities. You can install using 'pip install scrapy' or download it from GitHub, PyPI.
Scrapy, a fast high-level web crawling & scraping framework for Python.
    Support
      Quality
        Security
          License
            Reuse
            Support
              Quality
                Security
                  License
                    Reuse

                      kandi-support Support

                        summary
                        scrapy has a highly active ecosystem.
                        summary
                        It has 46510 star(s) with 9897 fork(s). There are 1782 watchers for this library.
                        summary
                        There were 4 major release(s) in the last 6 months.
                        summary
                        There are 483 open issues and 2292 have been closed. On average issues are closed in 760 days. There are 253 open pull requests and 0 closed requests.
                        summary
                        It has a positive sentiment in the developer community.
                        summary
                        The latest version of scrapy is 2.8.0
                        scrapy Support
                          Best in #Crawler
                            Average in #Crawler
                            scrapy Support
                              Best in #Crawler
                                Average in #Crawler

                                  kandi-Quality Quality

                                    summary
                                    scrapy has 47 bugs (1 blocker, 3 critical, 31 major, 12 minor) and 588 code smells.
                                    scrapy Quality
                                      Best in #Crawler
                                        Average in #Crawler
                                        scrapy Quality
                                          Best in #Crawler
                                            Average in #Crawler

                                              kandi-Security Security

                                                summary
                                                scrapy has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
                                                summary
                                                scrapy code analysis shows 6 unresolved vulnerabilities (1 blocker, 0 critical, 5 major, 0 minor).
                                                summary
                                                There are 1523 security hotspots that need review.
                                                scrapy Security
                                                  Best in #Crawler
                                                    Average in #Crawler
                                                    scrapy Security
                                                      Best in #Crawler
                                                        Average in #Crawler

                                                          kandi-License License

                                                            summary
                                                            scrapy is licensed under the BSD-3-Clause License. This license is Permissive.
                                                            summary
                                                            Permissive licenses have the least restrictions, and you can use them in most projects.
                                                            scrapy License
                                                              Best in #Crawler
                                                                Average in #Crawler
                                                                scrapy License
                                                                  Best in #Crawler
                                                                    Average in #Crawler

                                                                      kandi-Reuse Reuse

                                                                        summary
                                                                        scrapy releases are available to install and integrate.
                                                                        summary
                                                                        Deployable package is available in PyPI.
                                                                        summary
                                                                        Build file is available. You can build the component from source.
                                                                        summary
                                                                        scrapy saves you 17074 person hours of effort in developing the same functionality from scratch.
                                                                        summary
                                                                        It has 33894 lines of code, 3646 functions and 333 files.
                                                                        summary
                                                                        It has low code complexity. Code complexity directly impacts maintainability of the code.
                                                                        scrapy Reuse
                                                                          Best in #Crawler
                                                                            Average in #Crawler
                                                                            scrapy Reuse
                                                                              Best in #Crawler
                                                                                Average in #Crawler
                                                                                  Top functions reviewed by kandi - BETA
                                                                                  kandi has reviewed scrapy and discovered the below as its top functions. This is intended to give you an instant insight into scrapy implemented functionality, and help decide if they suit your requirements.
                                                                                  • Called when the response is ready
                                                                                    • Return a list of values
                                                                                    • Creates headers from a twisted response
                                                                                    • Update the values in seq
                                                                                  • Create a deprecated class
                                                                                    • Return the path to the class
                                                                                    • Check if the given subclass is a subclass of the subclass
                                                                                  • Recursively follow requests
                                                                                    • Parse a selector
                                                                                  • Execute scrapy
                                                                                  • Handle data received from the crawler
                                                                                  • Create a subclass of ScrapyRequestQueue
                                                                                  • Called when an item processor is dropped
                                                                                  • Download robots txt file
                                                                                  • Log download errors
                                                                                  • Callback function for verifying SSL connection
                                                                                  • Start the crawler
                                                                                  • Follow given URLs
                                                                                  • Return media to download
                                                                                  • Return whether a cached response is fresh
                                                                                  • Runs text tests
                                                                                  • Follow a URL
                                                                                  • Returns a list of request headers
                                                                                  • Called when the request is downloaded
                                                                                  • Call download function
                                                                                  • Configure logging
                                                                                  Get all kandi verified functions for this library.
                                                                                  Get all kandi verified functions for this library.

                                                                                  scrapy Key Features

                                                                                  Scrapy, a fast high-level web crawling & scraping framework for Python.

                                                                                  scrapy Examples and Code Snippets

                                                                                  Compile the documentation
                                                                                  Pythondot imgLines of Code : 0dot imgLicense : Non-SPDX (NOASSERTION)
                                                                                  copy iconCopy
                                                                                  
                                                                                                                      make html
                                                                                  0
                                                                                  Setup the environment
                                                                                  Pythondot imgLines of Code : 0dot imgLicense : Non-SPDX (NOASSERTION)
                                                                                  copy iconCopy
                                                                                  
                                                                                                                      pip install -r requirements.txt
                                                                                  0
                                                                                  Recreating documentation on the fly
                                                                                  Pythondot imgLines of Code : 0dot imgLicense : Non-SPDX (NOASSERTION)
                                                                                  copy iconCopy
                                                                                  
                                                                                                                      make watch
                                                                                  0
                                                                                  Passing Variables to Scrapy
                                                                                  Pythondot imgLines of Code : 14dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
                                                                                  
                                                                                  class MySpider(Spider):
                                                                                      name = 'myspider'
                                                                                      ...
                                                                                      def parse(self, response):
                                                                                          ...
                                                                                          if self.parameter1 == value1:
                                                                                              # this is True
                                                                                  
                                                                                          # or also
                                                                                          if getattr(self, parameter2) == value2:
                                                                                              # this is also True
                                                                                  
                                                                                  Scrape endpoint with Basic authentication
                                                                                  Pythondot imgLines of Code : 2dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  request.setRequestHeader("Authorization", "Basic "+btoa("apiInfoelectoral:apiInfoelectoralPro"));
                                                                                  
                                                                                  copy iconCopy
                                                                                  names_to_search = []
                                                                                  
                                                                                  def get_names_to_search():
                                                                                      # open file to read
                                                                                      file = open ("cegek.txt", "r")
                                                                                      # read lines in file
                                                                                      lines = file.readlines()
                                                                                      # loop through file and append names to list
                                                                                      for line in lines:
                                                                                          names_to_search.append(line.strip())   
                                                                                  
                                                                                  # The names_to_search list will contain:
                                                                                  
                                                                                  ['SZIMIKRON Ipari Kft.', 'Tigra Computer- és Irodatechnikai Kft.', 'Tradeland Kft.', 'Török László EV Török Kulcsszervíz', 'Tungsram Operations Kft.', 'Tutti Élelmiszeripari Kft.', 'Water and Soil Kft.', 'Webkey Development Kft.', 'ZDMnet']
                                                                                  
                                                                                  for name in names_to_search:
                                                                                      driver.find_element_by_xpath("//input[@type='search']").send_keys(name)
                                                                                  
                                                                                  Scrapy - ReactorAlreadyInstalledError when using TwistedScheduler
                                                                                  Pythondot imgLines of Code : 26dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  
                                                                                  from scrapy.crawler import CrawlerRunner
                                                                                  from scrapy.utils.project import get_project_settings
                                                                                  from scrapy.utils.log import configure_logging
                                                                                  from twisted.internet import reactor
                                                                                  from apscheduler.schedulers.twisted import TwistedScheduler
                                                                                  
                                                                                  from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
                                                                                  from myprojectscraper.spiders.my_spider import MySpider
                                                                                  
                                                                                  configure_logging()
                                                                                  
                                                                                  runner = CrawlerRunner(get_project_settings())
                                                                                  scheduler = TwistedScheduler(timezone="Europe/Amsterdam")
                                                                                  # Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
                                                                                  scheduler.add_job(runner.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
                                                                                  # Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
                                                                                  scheduler.add_job(runner.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
                                                                                  
                                                                                  deferred = runner.join()
                                                                                  deferred.addBoth(lambda _: reactor.stop())
                                                                                  
                                                                                  scheduler.start()
                                                                                  reactor.run()  # the script will block here until all crawling jobs are finished
                                                                                  scheduler.shutdown()
                                                                                  
                                                                                  how to extract a specific text with no tag by python scrapy?(new problem)
                                                                                  Pythondot imgLines of Code : 35dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  import scrapy
                                                                                  from scrapy.crawler import CrawlerProcess
                                                                                  class MmSpider(scrapy.Spider):
                                                                                      name = 'name'
                                                                                      start_urls = ['https://www.timeout.com/film/best-movies-of-all-time']
                                                                                  
                                                                                      def parse(self, response):
                                                                                          for title in response.xpath('//h3[@class="_h3_cuogz_1"]'):
                                                                                              yield {
                                                                                                  'title':title.xpath('.//text()').getall()[-1].replace('\xa0','')
                                                                                              }
                                                                                  
                                                                                  if __name__ == "__main__":
                                                                                      process = CrawlerProcess()
                                                                                      process.crawl(MmSpider)
                                                                                      process.start()
                                                                                  
                                                                                  {'title': '2001: A Space Odyssey (1968)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'The Godfather (1972)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'Citizen Kane (1941)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'Raiders of the Lost Ark (1981)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'La Dolce Vita (1960)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'Seven Samurai (1954)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'In the Mood for Love (2000)'}
                                                                                  2022-04-12 15:17:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.timeout.com/film/best-movies-of-all-time>
                                                                                  {'title': 'There Will Be Blood (2007)'}
                                                                                  
                                                                                  Scrapy - ReactorAlreadyInstalledError when using TwistedScheduler
                                                                                  Pythondot imgLines of Code : 49dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  from multiprocessing import Process
                                                                                  
                                                                                  from scrapy.crawler import CrawlerRunner
                                                                                  from scrapy.utils.project import get_project_settings
                                                                                  from scrapy.utils.log import configure_logging
                                                                                  from apscheduler.schedulers.blocking import BlockingScheduler
                                                                                  
                                                                                  from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
                                                                                  from myprojectscraper.spiders.my_spider import MySpider
                                                                                  
                                                                                  from twisted.internet import reactor
                                                                                  
                                                                                  # Create Process around the CrawlerRunner
                                                                                  class CrawlerRunnerProcess(Process):
                                                                                      def __init__(self, spider):
                                                                                          Process.__init__(self)
                                                                                          self.runner = CrawlerRunner(get_project_settings())
                                                                                          self.spider = spider
                                                                                  
                                                                                      def run(self):
                                                                                          deferred = self.runner.crawl(self.spider)
                                                                                          deferred.addBoth(lambda _: reactor.stop())
                                                                                          reactor.run(installSignalHandlers=False)
                                                                                  
                                                                                  # The wrapper to make it run multiple spiders, multiple times
                                                                                  def run_spider(spider):
                                                                                      crawler = CrawlerRunnerProcess(spider)
                                                                                      crawler.start()
                                                                                      crawler.join()
                                                                                  
                                                                                  # Enable logging when using CrawlerRunner
                                                                                  configure_logging()
                                                                                  
                                                                                  # Start the crawler in a scheduler
                                                                                  scheduler = BlockingScheduler(timezone="Europe/Amsterdam")
                                                                                  # Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
                                                                                  scheduler.add_job(run_spider, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
                                                                                  # Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
                                                                                  scheduler.add_job(run_spider, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
                                                                                  scheduler.start()
                                                                                  
                                                                                  2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
                                                                                  2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
                                                                                  2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
                                                                                  2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
                                                                                  2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Scheduler started
                                                                                  2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Looking for jobs to run
                                                                                  2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2022-04-01 00:10:00+02:00 (in 4775.280995 seconds)
                                                                                  
                                                                                  How I can crawl a table's all data with scrapy
                                                                                  Pythondot imgLines of Code : 67dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  from ast import parse
                                                                                  from fileinput import filename
                                                                                  import scrapy
                                                                                  from scrapy.crawler import CrawlerProcess
                                                                                  
                                                                                  class PostsSpider(scrapy.Spider):
                                                                                      name = "posts"
                                                                                  
                                                                                      start_urls= ['https://publicholidays.com.bd/2022-dates']
                                                                                      
                                                                                      def parse(self, response):
                                                                                          for post in response.css('.publicholidays tbody tr'):
                                                                                              yield{
                                                                                                  'date' : post.css('td:nth-child(1)::text').get(),
                                                                                                  'day' : post.css('td:nth-child(2)::text' ).get(),
                                                                                                  'event' : post.css('td:nth-child(3) a::text').get() or post.css('td:nth-child(3) span::text').get()
                                                                                              }
                                                                                  if __name__ == "__main__":
                                                                                      process = CrawlerProcess()
                                                                                      process.crawl(PostsSpider)
                                                                                      process.start()
                                                                                  
                                                                                  {'date': '21 Feb', 'day': 'Mon', 'event': 'Shaheed Day'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '17 Mar', 'day': 'Thu', 'event': "Sheikh Mujibur Rahman's Birthday"}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '18 Mar', 'day': 'Fri', 'event': 'Shab e-Barat'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '26 Mar', 'day': 'Sat', 'event': 'Independence Day'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '14 Apr', 'day': 'Thu', 'event': 'Bengali New Year'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '28 Apr', 'day': 'Thu', 'event': 'Laylat al-Qadr'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '29 Apr', 'day': 'Fri', 'event': 'Jumatul Bidah'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '1 May', 'day': 'Sun', 'event': 'May Day'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '2 May', 'day': 'Mon', 'event': 'Eid ul-Fitr Holiday'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '3 May', 'day': 'Tue', 'event': 'Eid ul-Fitr'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '4 May', 'day': 'Wed', 'event': 'Eid ul-Fitr Holiday'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '16 May', 'day': 'Mon', 'event': 'Buddha Purnima'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '9 Jul', 'day': 'Sat', 'event': 'Eid ul-Adha Holiday'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '\n', 'day': None, 'event': None}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '10 Jul', 'day': 'Sun', 'event': 'Eid ul-Adha'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '11 Jul', 'day': 'Mon', 'event': 'Eid ul-Adha Holiday'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '9 Aug', 'day': 'Tue', 'event': 'Ashura'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '15 Aug', 'day': 'Mon', 'event': 'National Mourning Day'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '19 Aug', 'day': 'Fri', 'event': 'Shuba Janmashtami'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '5 Oct', 'day': 'Wed', 'event': 'Vijaya Dashami'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '9 Oct', 'day': 'Sun', 'event': 'Eid-e-Milad un-Nabi'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  {'date': '16 Dec', 'day': 'Fri', 'event': 'Victory Day'}
                                                                                  2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
                                                                                  
                                                                                  Community Discussions

                                                                                  Trending Discussions on scrapy

                                                                                  How to correclty loop links with Scrapy?
                                                                                  chevron right
                                                                                  Scrapy exclude URLs containing specific text
                                                                                  chevron right
                                                                                  Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
                                                                                  chevron right
                                                                                  How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
                                                                                  chevron right
                                                                                  Scrapy display response.request.url inside zip()
                                                                                  chevron right
                                                                                  Yielding values from consecutive parallel parse functions via meta in Scrapy
                                                                                  chevron right
                                                                                  Parsing an 'Load More' response with HTML content
                                                                                  chevron right
                                                                                  loop through multiple URLs to scrape from a CSV file in Scrapy is not working
                                                                                  chevron right
                                                                                  During recursive scraping in scrapy, how extract info from multiple nodes of parent url and associated children url together?
                                                                                  chevron right
                                                                                  How to set class variable through __init__ in Python?
                                                                                  chevron right

                                                                                  QUESTION

                                                                                  How to correclty loop links with Scrapy?
                                                                                  Asked 2022-Mar-03 at 09:22

                                                                                  I'm using Scrapy and I'm having some problems while loop through a link.

                                                                                  I'm scraping the majority of information from one single page except one which points to another page.

                                                                                  There are 10 articles on each page. For each article I have to get the abstract which is on a second page. The correspondence between articles and abstracts is 1:1.

                                                                                  Here the divsection I'm using to scrape the data:

                                                                                  To do so I have defined the following script

                                                                                  from cgitb import text
                                                                                  import scrapy
                                                                                  import pandas as pd
                                                                                  
                                                                                  
                                                                                  class QuotesSpider(scrapy.Spider):
                                                                                      name = "jps"
                                                                                  
                                                                                      start_urls = ['https://www.tandfonline.com/toc/fjps20/current']
                                                                                      
                                                                                  
                                                                                      def parse(self, response):
                                                                                          self.logger.info('hello this is my first spider')
                                                                                          Title = response.xpath("//span[@class='hlFld-Title']").extract()
                                                                                          Authors = response.xpath("//span[@class='articleEntryAuthorsLinks']").extract()
                                                                                          License = response.xpath("//span[@class='part-tooltip']").extract()
                                                                                          abstract_url = response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract()
                                                                                          row_data = zip(Title, Authors, License, abstract_url)
                                                                                          
                                                                                          for quote in row_data:
                                                                                              scraped_info = {
                                                                                                  # key:value
                                                                                                  'Title': quote[0],
                                                                                                  'Authors': quote[1],
                                                                                                  'License': quote[2],
                                                                                                  'Abstract': quote[3]
                                                                                              }
                                                                                              # yield/give the scraped info to scrapy
                                                                                              yield scraped_info
                                                                                      
                                                                                      
                                                                                      def parse_links(self, response):
                                                                                          
                                                                                          for links in response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract():
                                                                                              yield scrapy.Request(links, callback=self.parse_abstract_page)
                                                                                          #yield response.follow(abstract_url, callback=self.parse_abstract_page)
                                                                                      
                                                                                      def parse_abstract_page(self, response):
                                                                                          Abstract = response.xpath("//div[@class='hlFld-Abstract']").extract_first()
                                                                                          row_data = zip(Abstract)
                                                                                          for quote in row_data:
                                                                                              scraped_info_abstract = {
                                                                                                  # key:value
                                                                                                  'Abstract': quote[0]
                                                                                              }
                                                                                              # yield/give the scraped info to scrapy
                                                                                              yield scraped_info_abstract
                                                                                          
                                                                                  

                                                                                  Authors, title and license are correctly scraped. For the Abstract I'm having the following error:

                                                                                  ValueError: Missing scheme in request url: /doi/abs/10.1080/03066150.2021.1956473
                                                                                  

                                                                                  To check if the path was correct I removed the abstract_url from the loop:

                                                                                   abstract_url = response.xpath('// [@class="tocDeliverFormatsLinks"]/a/@href').extract_first()
                                                                                   self.logger.info('get abstract page url')
                                                                                   yield response.follow(abstract_url, callback=self.parse_abstract)
                                                                                  

                                                                                  I can correctly reach the abstract corresponding to the first article, but not the others. I think the error is in the loop.

                                                                                  How can I solve this issue?

                                                                                  Thanks

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-01 at 19:43

                                                                                  The link to the article abstract appears to be a relative link (from the exception). /doi/abs/10.1080/03066150.2021.1956473 doesn't start with https:// or http://.

                                                                                  You should append this relative URL to the base URL of the website (i.e. if the base URL is "https://www.tandfonline.com", you can

                                                                                  import urllib.parse
                                                                                  
                                                                                  link = urllib.parse.urljoin("https://www.tandfonline.com", link)
                                                                                  

                                                                                  Then you'll have a proper URL to the resource.

                                                                                  Source https://stackoverflow.com/questions/71308962

                                                                                  QUESTION

                                                                                  Scrapy exclude URLs containing specific text
                                                                                  Asked 2022-Feb-24 at 02:49

                                                                                  I have a problem with a Scrapy Python program I'm trying to build. The code is the following.

                                                                                  import scrapy
                                                                                  from scrapy.spiders import CrawlSpider, Rule
                                                                                  from scrapy.linkextractors import LinkExtractor
                                                                                  
                                                                                  class LinkscrawlItem(scrapy.Item):
                                                                                      link = scrapy.Field()
                                                                                      attr = scrapy.Field()
                                                                                  
                                                                                  class someSpider(CrawlSpider):
                                                                                    name = 'mysitecrawler'
                                                                                    item = []
                                                                                  
                                                                                    allowed_domains = ['mysite.co.uk']
                                                                                    start_urls = ['https://mysite.co.uk/']
                                                                                  
                                                                                    rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
                                                                                      Rule (LinkExtractor(deny=('my-account', 'cart', 'checkout', 'wp-content')))
                                                                                    )
                                                                                  
                                                                                    def parse_obj(self,response):
                                                                                      item = LinkscrawlItem()
                                                                                      item["link"] = str(response.url)+":"+str(response.status)
                                                                                      filename = 'links2.txt'
                                                                                      with open(filename, 'a') as f:
                                                                                        f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
                                                                                      self.log('Saved file %s' % filename)
                                                                                  

                                                                                  I'm having trouble with the LinkExtractor, for me the deny is meant to exclude from the crawl the list of links I gave it. But it is still crawling them. For the first three the URLs are:

                                                                                  https://mysite.co.uk/my-account/

                                                                                  https://mysite.co.uk/cart/

                                                                                  https://mysite.co.uk/checkout/

                                                                                  The last one is containing wp-content, example:

                                                                                  https://mysite.co.uk/wp-content/uploads/01/22/photo.jpg

                                                                                  Would anyone know what I'm doing wrong with my deny list please?

                                                                                  Thank you

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-24 at 02:49

                                                                                  You have two issues with your code. First, you have two Rules in your crawl spider and you have included the deny restriction in the second rule which never gets checked because the first Rule follows all links and then calls the callback. Therefore the first rule is checked first and therefore it does not exclude the urls you don't want to crawl. Second issue is that in your second rule, you have included the literal string of what you want to avoid scraping but deny expects regular expressions.

                                                                                  Solution is to remove the first rule and slightly change the deny argument by escaping special regex characters in the url such as -. See below sample.

                                                                                  import scrapy
                                                                                  from scrapy.spiders import CrawlSpider, Rule
                                                                                  from scrapy.linkextractors import LinkExtractor
                                                                                  
                                                                                  class LinkscrawlItem(scrapy.Item):
                                                                                      link = scrapy.Field()
                                                                                      attr = scrapy.Field()
                                                                                  
                                                                                  class SomeSpider(CrawlSpider):
                                                                                      name = 'mysitecrawler'
                                                                                      allowed_domains = ['mysite.co.uk']
                                                                                      start_urls = ['https://mysite.co.uk/']
                                                                                  
                                                                                      rules = (
                                                                                          Rule (LinkExtractor(deny=('my\-account', 'cart', 'checkout', 'wp\-content')), callback="parse_obj", follow=True),
                                                                                      )
                                                                                  
                                                                                      def parse_obj(self,response):
                                                                                          item = LinkscrawlItem()
                                                                                          item["link"] = str(response.url)+":"+str(response.status)
                                                                                          filename = 'links2.txt'
                                                                                          with open(filename, 'a') as f:
                                                                                              f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
                                                                                          self.log('Saved file %s' % filename)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71224474

                                                                                  QUESTION

                                                                                  Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
                                                                                  Asked 2022-Jan-22 at 16:39

                                                                                  I have the following scrapy CrawlSpider:

                                                                                  import logger as lg
                                                                                  from scrapy.crawler import CrawlerProcess
                                                                                  from scrapy.http import Response
                                                                                  from scrapy.spiders import CrawlSpider, Rule
                                                                                  from scrapy_splash import SplashTextResponse
                                                                                  from urllib.parse import urlencode
                                                                                  from scrapy.linkextractors import LinkExtractor
                                                                                  from scrapy.http import HtmlResponse
                                                                                  
                                                                                  logger = lg.get_logger("oddsportal_spider")
                                                                                  
                                                                                  
                                                                                  class SeleniumScraper(CrawlSpider):
                                                                                      
                                                                                      name = "splash"
                                                                                      
                                                                                      custom_settings = {
                                                                                          "USER_AGENT": "*",
                                                                                          "LOG_LEVEL": "WARNING",
                                                                                          "DOWNLOADER_MIDDLEWARES": {
                                                                                              'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
                                                                                          },
                                                                                      }
                                                                                  
                                                                                      httperror_allowed_codes = [301]
                                                                                      
                                                                                      start_urls = ["https://www.oddsportal.com/tennis/results/"]
                                                                                      
                                                                                      rules = (
                                                                                          Rule(
                                                                                              LinkExtractor(allow="/atp-buenos-aires/results/"),
                                                                                              callback="parse_tournament",
                                                                                              follow=True,
                                                                                          ),
                                                                                          Rule(
                                                                                              LinkExtractor(
                                                                                                  allow="/tennis/",
                                                                                                  restrict_xpaths=("//td[@class='name table-participant']//a"),
                                                                                              ),
                                                                                              callback="parse_match",
                                                                                          ),
                                                                                      )
                                                                                  
                                                                                      def parse_tournament(self, response: Response):
                                                                                          logger.info(f"Parsing tournament - {response.url}")
                                                                                      
                                                                                      def parse_match(self, response: Response):
                                                                                          logger.info(f"Parsing match - {response.url}")
                                                                                  
                                                                                  
                                                                                  process = CrawlerProcess()
                                                                                  process.crawl(SeleniumScraper)
                                                                                  process.start()
                                                                                  

                                                                                  The Selenium middleware is as follows:

                                                                                  class SeleniumMiddleware:
                                                                                  
                                                                                      @classmethod
                                                                                      def from_crawler(cls, crawler):
                                                                                          middleware = cls()
                                                                                          crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
                                                                                          crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
                                                                                          return middleware
                                                                                  
                                                                                      def process_request(self, request, spider):
                                                                                          logger.debug(f"Selenium processing request - {request.url}")
                                                                                          self.driver.get(request.url)
                                                                                          return HtmlResponse(
                                                                                              request.url,
                                                                                              body=self.driver.page_source,
                                                                                              encoding='utf-8',
                                                                                              request=request,
                                                                                          )
                                                                                  
                                                                                      def spider_opened(self, spider):
                                                                                          options = webdriver.FirefoxOptions()
                                                                                          options.add_argument("--headless")
                                                                                          self.driver = webdriver.Firefox(
                                                                                              options=options,
                                                                                              executable_path=Path("/opt/geckodriver/geckodriver"),
                                                                                          )
                                                                                  
                                                                                      def spider_closed(self, spider):
                                                                                          self.driver.close()
                                                                                  

                                                                                  End to end this takes around a minute for around 50ish pages. To try and speed things up and take advantage of multiple threads and Javascript I've implemented the following scrapy_splash spider:

                                                                                  class SplashScraper(CrawlSpider):
                                                                                      
                                                                                      name = "splash"
                                                                                      
                                                                                      custom_settings = {
                                                                                          "USER_AGENT": "*",
                                                                                          "LOG_LEVEL": "WARNING",
                                                                                          "SPLASH_URL": "http://localhost:8050",
                                                                                          "DOWNLOADER_MIDDLEWARES": {
                                                                                              'scrapy_splash.SplashCookiesMiddleware': 723,
                                                                                              'scrapy_splash.SplashMiddleware': 725,
                                                                                              'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
                                                                                          },
                                                                                          "SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
                                                                                          "DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
                                                                                          "HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
                                                                                      }
                                                                                  
                                                                                      httperror_allowed_codes = [301]
                                                                                      
                                                                                      start_urls = ["https://www.oddsportal.com/tennis/results/"]
                                                                                      
                                                                                      rules = (
                                                                                          Rule(
                                                                                              LinkExtractor(allow="/atp-buenos-aires/results/"),
                                                                                              callback="parse_tournament",
                                                                                              process_request="use_splash",
                                                                                              follow=True,
                                                                                          ),
                                                                                          Rule(
                                                                                              LinkExtractor(
                                                                                                  allow="/tennis/",
                                                                                                  restrict_xpaths=("//td[@class='name table-participant']//a"),
                                                                                              ),
                                                                                              callback="parse_match",
                                                                                              process_request="use_splash",
                                                                                          ),
                                                                                      )
                                                                                  
                                                                                      def process_links(self, links): 
                                                                                          for link in links: 
                                                                                              link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url}) 
                                                                                          return links
                                                                                  
                                                                                      def _requests_to_follow(self, response):
                                                                                          if not isinstance(response, (HtmlResponse, SplashTextResponse)):
                                                                                              return
                                                                                          seen = set()
                                                                                          for rule_index, rule in enumerate(self._rules):
                                                                                              links = [lnk for lnk in rule.link_extractor.extract_links(response)
                                                                                                       if lnk not in seen]
                                                                                              for link in rule.process_links(links):
                                                                                                  seen.add(link)
                                                                                                  request = self._build_request(rule_index, link)
                                                                                                  yield rule.process_request(request, response)
                                                                                  
                                                                                      def use_splash(self, request, response):
                                                                                          request.meta.update(splash={'endpoint': 'render.html'})
                                                                                          return request
                                                                                  
                                                                                      def parse_tournament(self, response: Response):
                                                                                          logger.info(f"Parsing tournament - {response.url}")
                                                                                      
                                                                                      def parse_match(self, response: Response):
                                                                                          logger.info(f"Parsing match - {response.url}")
                                                                                  

                                                                                  However, this takes about the same amount of time. I was hoping to see a big increase in speed :(

                                                                                  I've tried playing around with different DOWNLOAD_DELAY settings but that hasn't made things any faster.

                                                                                  All the concurrency settings are left at their defaults.

                                                                                  Any ideas on if/how I'm going wrong?

                                                                                  ANSWER

                                                                                  Answered 2022-Jan-22 at 16:39

                                                                                  Taking a stab at an answer here with no experience of the libraries.

                                                                                  It looks like Scrapy Crawlers themselves are single-threaded. To get multi-threaded behavior you need to configure your application differently or write code that makes it behave mulit-threaded. It sounds like you've already tried this so this is probably not news to you but make sure you have configured the CONCURRENT_REQUESTS and REACTOR_THREADPOOL_MAXSIZE.

                                                                                  https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize

                                                                                  I can't imagine there is much CPU work going on in the crawling process so i doubt it's a GIL issue.

                                                                                  Excluding GIL as an option there are two possibilities here:

                                                                                  1. Your crawler is not actually multi-threaded. This may be because you are missing some setup or configuration that would make it so. i.e. You may have set the env variables correctly but your crawler is written in a way that is processing requests for urls synchronously instead of submitting them to a queue.

                                                                                  To test this, create a global object and store a counter on it. Each time your crawler starts a request increment the counter. Each time your crawler finishes a request, decrement the counter. Then run a thread that prints the counter every second. If your counter value is always 1, then you are still running synchronously.

                                                                                  # global_state.py
                                                                                  
                                                                                  GLOBAL_STATE = {"counter": 0}
                                                                                  
                                                                                  # middleware.py
                                                                                  
                                                                                  from global_state import GLOBAL_STATE
                                                                                  
                                                                                  class SeleniumMiddleware:
                                                                                  
                                                                                      def process_request(self, request, spider):
                                                                                          GLOBAL_STATE["counter"] += 1
                                                                                          self.driver.get(request.url)
                                                                                          GLOBAL_STATE["counter"] -= 1
                                                                                  
                                                                                          ...
                                                                                  
                                                                                  # main.py
                                                                                  
                                                                                  from global_state import GLOBAL_STATE
                                                                                  import threading
                                                                                  import time
                                                                                  
                                                                                  def main():
                                                                                    gst = threading.Thread(target=gs_watcher)
                                                                                    gst.start()
                                                                                  
                                                                                    # Start your app here
                                                                                  
                                                                                  def gs_watcher():
                                                                                    while True:
                                                                                      print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
                                                                                      time.sleep(1)
                                                                                  
                                                                                  1. The site you are crawling is rate limiting you.

                                                                                  To test this, run the application multiple times. If you go from 50 req/s to 25 req/s per application then you are being rate limited. To skirt around this use a VPN to hop-around.

                                                                                  If after that you find that you are running concurrent requests, and you are not being rate limited, then there is something funky going on in the libraries. Try removing chunks of code until you get to the bare minimum of what you need to crawl. If you have gotten to the absolute bare minimum implementation and it's still slow then you now have a minimal reproducible example and can get much better/informed help.

                                                                                  Source https://stackoverflow.com/questions/70647245

                                                                                  QUESTION

                                                                                  How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
                                                                                  Asked 2022-Jan-20 at 15:35

                                                                                  I am working on certain stock-related projects where I have had a task to scrape all data on a daily basis for the last 5 years. i.e from 2016 to date. I particularly thought of using selenium because I can use crawler and bot to scrape the data based on the date. So I used the use of button click with selenium and now I want the same data that is displayed by the selenium browser to be fed by scrappy. This is the website I am working on right now. I have written the following code inside scrappy spider.

                                                                                  class FloorSheetSpider(scrapy.Spider):
                                                                                      name = "nepse"
                                                                                  
                                                                                      def start_requests(self):
                                                                                  
                                                                                          driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
                                                                                          
                                                                                       
                                                                                          floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']
                                                                                  
                                                                                          for date in floorsheet_dates:
                                                                                              driver.get(
                                                                                                  "https://merolagani.com/Floorsheet.aspx")
                                                                                  
                                                                                              driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                                                                                                  ).send_keys(date)
                                                                                              driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
                                                                                              total_length = driver.find_element(By.XPATH,
                                                                                                                                 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
                                                                                              z = int((total_length.split()[-1]).replace(']', ''))    
                                                                                              for data in range(z, z + 1):
                                                                                                  driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                                                                                                  self.url = driver.page_source
                                                                                                  yield Request(url=self.url, callback=self.parse)
                                                                                  
                                                                                                 
                                                                                      def parse(self, response, **kwargs):
                                                                                          for value in response.xpath('//tbody/tr'):
                                                                                              print(value.css('td::text').extract()[1])
                                                                                              print("ok"*200)
                                                                                  

                                                                                  Update: Error after answer is

                                                                                  2022-01-14 14:11:36 [twisted] CRITICAL: 
                                                                                  Traceback (most recent call last):
                                                                                    File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
                                                                                      result = current_context.run(gen.send, result)
                                                                                    File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
                                                                                      start_requests = iter(self.spider.start_requests())
                                                                                  TypeError: 'NoneType' object is not iterable
                                                                                  

                                                                                  I want to send current web html content to scrapy feeder but I am getting unusal error for past 2 days any help or suggestions will be very much appreciated.

                                                                                  ANSWER

                                                                                  Answered 2022-Jan-14 at 09:30

                                                                                  The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.

                                                                                  Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):

                                                                                  import scrapy
                                                                                  from selenium import webdriver
                                                                                  from selenium.webdriver.common.by import By
                                                                                  from scrapy.http import HtmlResponse
                                                                                  
                                                                                  
                                                                                  class FloorSheetSpider(scrapy.Spider):
                                                                                      name = "nepse"
                                                                                  
                                                                                      def start_requests(self):
                                                                                  
                                                                                          # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
                                                                                          driver = webdriver.Chrome()
                                                                                  
                                                                                          floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
                                                                                  
                                                                                          for date in floorsheet_dates:
                                                                                              driver.get(
                                                                                                  "https://merolagani.com/Floorsheet.aspx")
                                                                                  
                                                                                              driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                                                                                                  ).send_keys(date)
                                                                                              driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
                                                                                              total_length = driver.find_element(By.XPATH,
                                                                                                                                 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
                                                                                              z = int((total_length.split()[-1]).replace(']', ''))
                                                                                              for data in range(1, z + 1):
                                                                                                  driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                                                                                                  self.body = driver.page_source
                                                                                  
                                                                                                  response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
                                                                                                  for value in response.xpath('//tbody/tr'):
                                                                                                      print(value.css('td::text').extract()[1])
                                                                                                      print("ok"*200)
                                                                                  
                                                                                          # return an empty requests list
                                                                                          return []
                                                                                  

                                                                                  Solution 2 - with super simple downloader middleware:

                                                                                  (You might have a delay here in parse method so be patient).

                                                                                  import scrapy
                                                                                  from scrapy import Request
                                                                                  from scrapy.http import HtmlResponse
                                                                                  from selenium import webdriver
                                                                                  from selenium.webdriver.common.by import By
                                                                                  
                                                                                  
                                                                                  class SeleniumMiddleware(object):
                                                                                      def process_request(self, request, spider):
                                                                                          url = spider.driver.current_url
                                                                                          body = spider.driver.page_source
                                                                                          return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)
                                                                                  
                                                                                  
                                                                                  class FloorSheetSpider(scrapy.Spider):
                                                                                      name = "nepse"
                                                                                  
                                                                                      custom_settings = {
                                                                                          'DOWNLOADER_MIDDLEWARES': {
                                                                                              'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
                                                                                              # 'projects_name.path.to.your.pipeline': 543
                                                                                          }
                                                                                      }
                                                                                      driver = webdriver.Chrome()
                                                                                  
                                                                                      def start_requests(self):
                                                                                  
                                                                                          # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
                                                                                  
                                                                                  
                                                                                          floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
                                                                                  
                                                                                          for date in floorsheet_dates:
                                                                                              self.driver.get(
                                                                                                  "https://merolagani.com/Floorsheet.aspx")
                                                                                  
                                                                                              self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                                                                                                  ).send_keys(date)
                                                                                              self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
                                                                                              total_length = self.driver.find_element(By.XPATH,
                                                                                                                                 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
                                                                                              z = int((total_length.split()[-1]).replace(']', ''))
                                                                                              for data in range(1, z + 1):
                                                                                                  self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                                                                                                  self.body = self.driver.page_source
                                                                                                  self.url = self.driver.current_url
                                                                                  
                                                                                                  yield Request(url=self.url, callback=self.parse, dont_filter=True)
                                                                                  
                                                                                      def parse(self, response, **kwargs):
                                                                                          print('test ok')
                                                                                          for value in response.xpath('//tbody/tr'):
                                                                                              print(value.css('td::text').extract()[1])
                                                                                              print("ok"*200)
                                                                                  

                                                                                  Notice that I've used chrome so change it back to firefox like in your original code.

                                                                                  Source https://stackoverflow.com/questions/70651053

                                                                                  QUESTION

                                                                                  Scrapy display response.request.url inside zip()
                                                                                  Asked 2021-Dec-22 at 07:59

                                                                                  I'm trying to create a simple Scrapy function which will loop through a set of standard URLs and pull their Alexa Rank. The output I want is just two columns: One showing the scraped Alexa Rank, and one showing the URL which was scraped.

                                                                                  Everything seems to be working except that I cannot get the scraped URL to display correctly in my output. My code currently is:

                                                                                  import scrapy
                                                                                  
                                                                                  class AlexarSpider(scrapy.Spider):
                                                                                      name = 'AlexaR'
                                                                                      #Will update allowed domains and start URL once I fix this problem
                                                                                      start_urls = ['http://www.alexa.com/siteinfo/google.com/', 
                                                                                      'https://www.alexa.com/siteinfo/reddit.com']
                                                                                  
                                                                                      def parse(self, response):
                                                                                          rank = response.css(".rankmini-rank::text").extract()
                                                                                          url_raw = response.request.url
                                                                                      
                                                                                          #extract content into rows
                                                                                          for item in zip(url_raw,rank):
                                                                                              scraped_info = {
                                                                                                  str('url_raw') : item[0],
                                                                                                  'rank' : item[1]
                                                                                              }
                                                                                  
                                                                                          yield scraped_info
                                                                                  

                                                                                  And then when run, the code outputs a table showing:

                                                                                  AlexaRank Output

                                                                                  url_raw rank h t 21 t h t 1 t

                                                                                  These are the correct scraped rankings (21 and 1) but the url_raw field is showing "h" or "t", rather than the actual URL string value. I've tried converting the url_raw variable to a string with no luck.

                                                                                  How can I set the variable up such that it displays the correct URL?

                                                                                  Thank you in advance for any help!

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-22 at 07:59

                                                                                  Here zip() takes 'rank' which is a list and 'url_raw' which is a string so you get a character from 'url_raw' for each iteration.

                                                                                  Solution with cycle:

                                                                                  import scrapy
                                                                                  from itertools import cycle
                                                                                  
                                                                                  
                                                                                  class AlexarSpider(scrapy.Spider):
                                                                                      name = 'AlexaR'
                                                                                      #Will update allowed domains and start URL once I fix this problem
                                                                                      start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                                                                                                    'https://www.alexa.com/siteinfo/reddit.com']
                                                                                  
                                                                                      def parse(self, response):
                                                                                          rank = response.css(".rankmini-rank::text").extract()
                                                                                          url_raw = response.request.url
                                                                                          #extract content into rows
                                                                                          for item in zip(cycle([url_raw]), rank):
                                                                                              scraped_info = {
                                                                                                  str('url_raw'): item[0],
                                                                                                  'rank': item[1]
                                                                                              }
                                                                                              yield scraped_info
                                                                                  

                                                                                  Solution with list:

                                                                                  import scrapy
                                                                                  
                                                                                  
                                                                                  class AlexarSpider(scrapy.Spider):
                                                                                      name = 'AlexaR'
                                                                                      #Will update allowed domains and start URL once I fix this problem
                                                                                      start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                                                                                                    'https://www.alexa.com/siteinfo/reddit.com']
                                                                                  
                                                                                      def parse(self, response):
                                                                                          rank = response.css(".rankmini-rank::text").extract()
                                                                                          url_raw = [response.request.url for i in range(len(rank))]
                                                                                          #extract content into rows
                                                                                          for item in zip(url_raw, rank):
                                                                                              scraped_info = {
                                                                                                  str('url_raw'): item[0],
                                                                                                  'rank': item[1]
                                                                                              }
                                                                                              yield scraped_info
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70440363

                                                                                  QUESTION

                                                                                  Yielding values from consecutive parallel parse functions via meta in Scrapy
                                                                                  Asked 2021-Dec-20 at 07:53

                                                                                  In my scrapy code I'm trying to yield the following figures from parliament's website where all the members of parliament (MPs) are listed. Opening the links for each MP, I'm making parallel requests to get the figures I'm trying to count. I'm intending to yield each three figures below in the company of the name and the party of the MP

                                                                                  Here are the figures I'm trying to scrape

                                                                                  1. How many bill proposals that each MP has their signature on
                                                                                  2. How many question proposals that each MP has their signature on
                                                                                  3. How many times that each MP spoke on the parliament

                                                                                  In order to count and yield out how many bills has each member of parliament has their signature on, I'm trying to write a scraper on the members of parliament which works with 3 layers:

                                                                                  • Starting with the link where all MPs are listed
                                                                                  • From (1) accessing the individual page of each MP where the three information defined above is displayed
                                                                                  • 3a) Requesting the page with bill proposals and counting the number of them by len function 3b) Requesting the page with question proposals and counting the number of them by len function 3c) Requesting the page with speeches and counting the number of them by len function

                                                                                  What I want: I want to yield the inquiries of 3a,3b,3c with the name and the party of the MP in the same raw

                                                                                  • Problem 1) When I get an output to csv it only creates fields of speech count, name, part. It doesn't show me the fields of bill proposals and question proposals

                                                                                  • Problem 2) There are two empty values for each MP, which I guess corresponds to the values I described above at Problem1

                                                                                  • Problem 3) What is the better way of restructuring my code to output the three values in the same line, rather than printing each MP three times for each value that I'm scraping

                                                                                  from scrapy import Spider
                                                                                  from scrapy.http import Request
                                                                                  
                                                                                  import logging
                                                                                  
                                                                                  
                                                                                  class MvSpider(Spider):
                                                                                      name = 'mv2'
                                                                                      allowed_domains = ['tbmm.gov.tr']
                                                                                      start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']
                                                                                  
                                                                                      def parse(self, response):
                                                                                          mv_list =  mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed
                                                                                  
                                                                                          for mv in mv_list:
                                                                                              name = mv.xpath("./li/div/div/a/text()").get() # MP's name taken
                                                                                              party = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken
                                                                                              partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
                                                                                              full_link = response.urljoin(partial_link)
                                                                                  
                                                                                              yield Request(full_link, callback = self.mv_analysis, meta = {
                                                                                                                                                              'name': name,
                                                                                                                                                              'party': party
                                                                                                                                                          })
                                                                                  
                                                                                  
                                                                                      def mv_analysis(self, response):
                                                                                          name = response.meta.get('name')
                                                                                          party = response.meta.get('party')
                                                                                  
                                                                                          billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
                                                                                          billprop_link = response.urljoin(billprop_link_path)
                                                                                  
                                                                                          questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
                                                                                          questionprop_link = response.urljoin(questionprop_link_path)
                                                                                  
                                                                                          speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
                                                                                          speech_link = response.urljoin(speech_link_path)
                                                                                  
                                                                                          yield Request(billprop_link, callback = self.bill_prop_counter, meta = {
                                                                                                                                                              'name': name,
                                                                                                                                                              'party': party
                                                                                                                                                          })  #number of bill proposals to be requested
                                                                                  
                                                                                          yield Request(questionprop_link, callback = self.quest_prop_counter, meta = {
                                                                                                                                                              'name': name,
                                                                                                                                                              'party': party
                                                                                                                                                          }) #number of question propoesals to be requested
                                                                                  
                                                                                  
                                                                                          yield Request(speech_link, callback = self.speech_counter, meta = {
                                                                                                                                                              'name': name,
                                                                                                                                                              'party': party
                                                                                                                                                          })  #number of speeches to be requested
                                                                                  
                                                                                  
                                                                                  
                                                                                  
                                                                                  # COUNTING FUNCTIONS
                                                                                  
                                                                                  
                                                                                      def bill_prop_counter(self,response):
                                                                                  
                                                                                          name = response.meta.get('name')
                                                                                          party = response.meta.get('party')
                                                                                  
                                                                                          billproposals = response.xpath("//tr[@valign='TOP']")
                                                                                  
                                                                                          yield  { 'bill_prop_count': len(billproposals),
                                                                                                  'name': name,
                                                                                                  'party': party}
                                                                                  
                                                                                      def quest_prop_counter(self, response):
                                                                                  
                                                                                          name = response.meta.get('name')
                                                                                          party = response.meta.get('party')
                                                                                  
                                                                                          researchproposals = response.xpath("//tr[@valign='TOP']")
                                                                                  
                                                                                          yield {'res_prop_count': len(researchproposals),
                                                                                                 'name': name,
                                                                                                 'party': party}
                                                                                  
                                                                                      def speech_counter(self, response):
                                                                                  
                                                                                          name = response.meta.get('name')
                                                                                          party = response.meta.get('party')
                                                                                  
                                                                                          speeches = response.xpath("//tr[@valign='TOP']")
                                                                                  
                                                                                          yield { 'speech_count' : len(speeches),
                                                                                                 'name': name,
                                                                                                 'party': party}
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-18 at 06:26

                                                                                  This is happening because you are yielding dicts instead of item objects, so spider engine will not have a guide of fields you want to have as default.

                                                                                  In order to make the csv output fields bill_prop_count and res_prop_count, you should make the following changes in your code:

                                                                                  1 - Create a base item object with all desirable fields - you can create this in the items.py file of your scrapy project:

                                                                                  from scrapy import Item, Field
                                                                                  
                                                                                  
                                                                                  class MvItem(Item):
                                                                                      name = Field()
                                                                                      party = Field()
                                                                                      bill_prop_count = Field()
                                                                                      res_prop_count = Field()
                                                                                      speech_count = Field()
                                                                                  

                                                                                  2 - Import the item object created to the spider code & yield items populated with the dict, instead of single dicts:

                                                                                  from your_project.items import MvItem
                                                                                  
                                                                                  ...
                                                                                  
                                                                                  # COUNTING FUNCTIONS
                                                                                  def bill_prop_counter(self,response):
                                                                                      name = response.meta.get('name')
                                                                                      party = response.meta.get('party')
                                                                                  
                                                                                      billproposals = response.xpath("//tr[@valign='TOP']")
                                                                                  
                                                                                      yield MvItem(**{ 'bill_prop_count': len(billproposals),
                                                                                              'name': name,
                                                                                              'party': party})
                                                                                  
                                                                                  def quest_prop_counter(self, response):
                                                                                      name = response.meta.get('name')
                                                                                      party = response.meta.get('party')
                                                                                  
                                                                                      researchproposals = response.xpath("//tr[@valign='TOP']")
                                                                                  
                                                                                      yield MvItem(**{'res_prop_count': len(researchproposals),
                                                                                             'name': name,
                                                                                             'party': party})
                                                                                  
                                                                                  def speech_counter(self, response):
                                                                                      name = response.meta.get('name')
                                                                                      party = response.meta.get('party')
                                                                                  
                                                                                      speeches = response.xpath("//tr[@valign='TOP']")
                                                                                  
                                                                                      yield MvItem(**{ 'speech_count' : len(speeches),
                                                                                             'name': name,
                                                                                             'party': party})
                                                                                  

                                                                                  The output csv will have all possible columns for the item:

                                                                                  bill_prop_count,name,party,res_prop_count,speech_count
                                                                                  ,Abdullah DOĞRU,AK Parti,,11
                                                                                  ,Mehmet Şükrü ERDİNÇ,AK Parti,,3
                                                                                  ,Muharrem VARLI,MHP,,13
                                                                                  ,Muharrem VARLI,MHP,0,
                                                                                  ,Jülide SARIEROĞLU,AK Parti,,3
                                                                                  ,İbrahim Halil FIRAT,AK Parti,,7
                                                                                  20,Burhanettin BULUT,CHP,,
                                                                                  ,Ünal DEMİRTAŞ,CHP,,22
                                                                                  ...
                                                                                  

                                                                                  Now if you want to have all the three counts in the same row, you'll have to change the design of your spider. Possibly one counting function at the time passing the item in the meta attribute.

                                                                                  Source https://stackoverflow.com/questions/70399191

                                                                                  QUESTION

                                                                                  Parsing an 'Load More' response with HTML content
                                                                                  Asked 2021-Dec-12 at 09:10

                                                                                  I'm trying to scrape each content in Istanbul Governorate's announcement section located at the link below, which loads content with a 'Load More' at the bottom of the page. From dev tools / Network, I checked properties of the POST request sent and updated the header accordingly. The response apparently is not json but an html code.

                                                                                  I would like to yield the parsed html responses but when I crawl it, it just doesn't return anything and stuck with the first request forever. Thank you in advance.

                                                                                  Could you explain me what's wrong with my code? I checked tens of questions here but couldn't resolve the issue. As I understand, it just can't parse the response html but I couldn't figure out why.

                                                                                  ps: I have been enthusiastically into Python and scraping for 20 days. Forgive my ignorance.

                                                                                  import scrapy
                                                                                  
                                                                                  class DuyurularSpider(scrapy.Spider):
                                                                                      name = 'duyurular'
                                                                                      allowed_domains = ['istanbul.gov.tr']
                                                                                      start_urls = ['http://istanbul.gov.tr/duyurular']
                                                                                  
                                                                                      headerz = {
                                                                                          "Accept": "*/*",
                                                                                          "Accept-Encoding": "gzip, deflate",
                                                                                          "Accept-Language": "en-US,en;q=0.9",
                                                                                          "Connection" : "keep-alive",
                                                                                          "Content-Length": "112",
                                                                                          "Content-Type": "application/json",
                                                                                          "Cookie" : "_ga=GA1.3.285027250.1638576047; _gid=GA1.3.363882495.1639180128; ASP.NET_SessionId=ijw1mmc5xrpiw2iz32hmqb3a; NSC_ESNS=3e8876df-bcc4-11b4-9678-e2abf1d948a7_2815152435_0584317866_00000000013933875891; _gat_gtag_UA_136413027_31=1",
                                                                                          "Host": "istanbul.gov.tr",
                                                                                          "Origin": "http://istanbul.gov.tr",
                                                                                          "Referer": "http://istanbul.gov.tr/duyurular",
                                                                                          "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
                                                                                          "X-Requested-With": "XMLHttpRequest",
                                                                                                  }
                                                                                  
                                                                                      def parse(self, response):
                                                                                  
                                                                                          url = 'http://istanbul.gov.tr/ISAYWebPart/Announcement/AnnouncementDahaFazlaYukle'
                                                                                          load_more = scrapy.Request(url, callback = self.parse_api, method = "POST", headers = self.headerz)    
                                                                                  
                                                                                          yield load_more    
                                                                                  
                                                                                      def parse_api(self, response):
                                                                                          raw_data = response.body
                                                                                  
                                                                                          
                                                                                          data = raw_data.xpath('//div[@class="ministry-announcements"]')
                                                                                  
                                                                                          for bilgi in data:
                                                                                  
                                                                                              gun =  bilgi.xpath('//div[@class = "day"]/text()').extract_first()  #day
                                                                                              ay = bilgi.xpath('//div[@class = "month"]/text()').extract_first() #month
                                                                                  
                                                                                              metin = bilgi.xpath('//a[@class ="announce-text"]/text()').extract_first() #text
                                                                                  
                                                                                              yield {'Ay:' : ay,
                                                                                                     'Gün' : gun,
                                                                                                     'Metin': metin,}
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-12 at 09:10
                                                                                  1. Remove Content-Length, also never include it in the headers. Also you should remove the cookie and let scrapy handle it.

                                                                                  2. Look at the request body and recreate it for every page:

                                                                                  3. You need to know when to stop, in this case it's an empty page.

                                                                                  4. in the bilgi.xpath part you're getting the same line over and over because you forgot a dot at the beginning.

                                                                                  The complete working code:

                                                                                  import scrapy
                                                                                  import json
                                                                                  
                                                                                  
                                                                                  class DuyurularSpider(scrapy.Spider):
                                                                                      name = 'duyurular'
                                                                                      allowed_domains = ['istanbul.gov.tr']
                                                                                      start_urls = ['http://istanbul.gov.tr/duyurular']
                                                                                      page = 1
                                                                                      
                                                                                      headers = {
                                                                                          "Accept": "*/*",
                                                                                          "Accept-Encoding": "gzip, deflate",
                                                                                          "Accept-Language": "en-US,en;q=0.9",
                                                                                          "Connection": "keep-alive",
                                                                                          "Content-Type": "application/json",
                                                                                          "Host": "istanbul.gov.tr",
                                                                                          "Origin": "http://istanbul.gov.tr",
                                                                                          "Referer": "http://istanbul.gov.tr/duyurular",
                                                                                          "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
                                                                                          "X-Requested-With": "XMLHttpRequest",
                                                                                      }
                                                                                  
                                                                                      def parse(self, response):
                                                                                          url = 'http://istanbul.gov.tr/ISAYWebPart/Announcement/AnnouncementDahaFazlaYukle'
                                                                                          body = {
                                                                                              "ContentCount": "8",
                                                                                              "ContentTypeId": "D6mHJdtwBYsvtS2xCvXiww==",
                                                                                              "GosterimSekli": "1",
                                                                                              "OrderByAsc": "true",
                                                                                              "page": f"{str(self.page)}"
                                                                                          }
                                                                                  
                                                                                          if response.body.strip():    # check if we get an empty page
                                                                                              load_more = scrapy.Request(url, method="POST", headers=self.headers, body=json.dumps(body))
                                                                                              yield load_more
                                                                                              self.page += 1
                                                                                  
                                                                                              data = response.xpath('//div[@class="ministry-announcements"]')
                                                                                              for bilgi in data:
                                                                                                  gun = bilgi.xpath('.//div[@class = "day"]/text()').extract_first()  #day
                                                                                                  ay = bilgi.xpath('.//div[@class = "month"]/text()').extract_first() #month
                                                                                  
                                                                                                  metin = bilgi.xpath('.//a[@class ="announce-text"]/text()').extract_first() #text
                                                                                  
                                                                                                  yield {
                                                                                                      'Ay:': ay,
                                                                                                      'Gün': gun,
                                                                                                      'Metin': metin.strip(),
                                                                                                  }
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70319472

                                                                                  QUESTION

                                                                                  loop through multiple URLs to scrape from a CSV file in Scrapy is not working
                                                                                  Asked 2021-Dec-01 at 18:53

                                                                                  When i try to execute this loop i got error please help i wanted to scrape multiple links using csv file but is stucks in start_urls i am using scrapy 2.5 and python 3.9.7

                                                                                  from scrapy import Request
                                                                                  from scrapy.http import request
                                                                                  import pandas as pd
                                                                                  
                                                                                  
                                                                                  class PagedataSpider(scrapy.Spider):
                                                                                      name = 'pagedata'
                                                                                      allowed_domains = ['www.imdb.com']
                                                                                  
                                                                                      def start_requests(self):
                                                                                          df = pd.read_csv('list1.csv')
                                                                                          #Here fileContainingUrls.csv is a csv file which has a column named as 'URLS'
                                                                                          # contains all the urls which you want to loop over. 
                                                                                          urlList = df['link'].values.to_list()
                                                                                          for i in urlList:
                                                                                              yield scrapy.Request(url = i, callback=self.parse)
                                                                                  

                                                                                  error:

                                                                                  2021-11-09 22:06:45 [scrapy.core.engine] INFO: Spider opened
                                                                                  2021-11-09 22:06:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
                                                                                  2021-11-09 22:06:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
                                                                                  2021-11-09 22:06:45 [scrapy.core.engine] ERROR: Error while obtaining start requests
                                                                                  Traceback (most recent call last):
                                                                                    File "C:\Users\Vivek\Desktop\Scrapy\myenv\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
                                                                                      request = next(slot.start_requests)
                                                                                    File "C:\Users\Vivek\Desktop\Scrapy\moviepages\moviepages\spiders\pagedata.py", line 18, in start_requests
                                                                                      urlList = df['link'].values.to_list()
                                                                                  AttributeError: 'numpy.ndarray' object has no attribute 'to_list'
                                                                                  2021-11-09 22:06:45 [scrapy.core.engine] INFO: Closing spider (finished)
                                                                                  2021-11-09 22:06:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
                                                                                  {'elapsed_time_seconds': 0.007159,
                                                                                   'finish_reason': 'finished',
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-Nov-09 at 17:07

                                                                                  The error you received is rather straightforward; a numpy array doesn't have a to_list method.

                                                                                  Instead you should simply iterate over the numpy array:

                                                                                  from scrapy.http import request
                                                                                  import pandas as pd
                                                                                  
                                                                                  
                                                                                  class PagedataSpider(scrapy.Spider):
                                                                                      name = 'pagedata'
                                                                                      allowed_domains = ['www.imdb.com']
                                                                                  
                                                                                      def start_requests(self):
                                                                                          df = pd.read_csv('list1.csv')
                                                                                  
                                                                                          urls = df['link']
                                                                                          for url in urls:
                                                                                              yield scrapy.Request(url=url, callback=self.parse)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/69902187

                                                                                  QUESTION

                                                                                  During recursive scraping in scrapy, how extract info from multiple nodes of parent url and associated children url together?
                                                                                  Asked 2021-Nov-22 at 13:09

                                                                                  The parent url got multiple nodes (quotes), each parent node got child url (author info). I am facing trouble linking the quote to author info, due to asynchronous nature of scrapy?

                                                                                  How can I fix this issue, here's the code so far. Added # <--- comment for easy spot.

                                                                                  import scrapy 
                                                                                  
                                                                                  class AuthorSpider(scrapy.Spider):
                                                                                      name = 'quotes1'
                                                                                      var = None # <----
                                                                                  
                                                                                      def start_requests(self):
                                                                                          start_urls = ['http://quotes.toscrape.com/']
                                                                                          yield scrapy.Request(url=start_urls[0], callback=self.parse)
                                                                                  
                                                                                      def parse(self, response):
                                                                                  
                                                                                          for quote in response.css('div.quote'):
                                                                                              AuthorSpider.var = quote.css('div span.text::text').get() # <----
                                                                                  
                                                                                              authShortLink = quote.css('small.author + a::attr(href)').get()
                                                                                              authFullLink = response.urljoin(authShortLink)
                                                                                              yield scrapy.Request(url=authFullLink, callback=self.parse_author)
                                                                                  
                                                                                          # # looping through next pages
                                                                                          # nextPage = response.css('li.next a::attr(href)').get()
                                                                                          # if nextPage is not None:
                                                                                          #     nextPage = response.urljoin(nextPage)
                                                                                          #     yield scrapy.Request(url=nextPage, callback=self.parse)
                                                                                  
                                                                                      def parse_author(self, response):
                                                                                          def extract_with_css(query):
                                                                                              return response.css(query).get(default='').strip()
                                                                                  
                                                                                          yield {
                                                                                              'name': extract_with_css('h3.author-title::text'),
                                                                                              'birthdate': extract_with_css('.author-born-date::text'),
                                                                                              'quote' : AuthorSpider.var
                                                                                          }
                                                                                  
                                                                                  

                                                                                  Please note that in order to allow duplication, added DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter' in settings.py

                                                                                  Output I am getting presently-

                                                                                  [
                                                                                  {"name": "Albert Einstein", "birthdate": "March 14, 1879", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Marilyn Monroe", "birthdate": "June 01, 1926", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Jane Austen", "birthdate": "December 16, 1775", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Albert Einstein", "birthdate": "March 14, 1879", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "J.K. Rowling", "birthdate": "July 31, 1965", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Albert Einstein", "birthdate": "March 14, 1879", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Steve Martin", "birthdate": "August 14, 1945", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Eleanor Roosevelt", "birthdate": "October 11, 1884", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Thomas A. Edison", "birthdate": "February 11, 1847", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"},
                                                                                  {"name": "Andr\u00e9 Gide", "birthdate": "November 22, 1869", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"}
                                                                                  ]
                                                                                  

                                                                                  Thanks in advance!

                                                                                  ANSWER

                                                                                  Answered 2021-Nov-22 at 13:09

                                                                                  Here is the minimal working solution. Both type of pagination is working and I use meta keyword to transfer quote item from one response to another.

                                                                                  import scrapy
                                                                                  class AuthorSpider(scrapy.Spider):
                                                                                      name = 'quotes1'
                                                                                      start_urls = [f'https://quotes.toscrape.com/page/{x}/' .format(x) for x in range(1,11)]
                                                                                  
                                                                                      def parse(self, response):
                                                                                  
                                                                                          for quote in response.css('div.quote'):
                                                                                              Author = quote.css('span.text::text').get()  # <----
                                                                                  
                                                                                              authShortLink = quote.css('small.author + a::attr(href)').get()
                                                                                              authFullLink = response.urljoin(authShortLink)
                                                                                              yield scrapy.Request(url=authFullLink, callback=self.parse_author, meta={'Author': Author})
                                                                                  
                                                                                          # # looping through next pages
                                                                                          # nextPage = response.css('li.next a::attr(href)').get()
                                                                                          # abs_url = f'http://quotes.toscrape.com/{nextPage}'
                                                                                              #yield scrapy.Request(url=abs_url, callback=self.parse)
                                                                                  
                                                                                      def parse_author(self, response):
                                                                                          quote=response.meta.get('Author')
                                                                                          yield {
                                                                                              'Name': response.css('h3.author-title::text').get().strip(),
                                                                                              'Date of birth': response.css('span.author-born-date::text').get(),
                                                                                              'Quote':quote,
                                                                                              'url':response.url}
                                                                                    
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70062567

                                                                                  QUESTION

                                                                                  How to set class variable through __init__ in Python?
                                                                                  Asked 2021-Nov-08 at 20:06

                                                                                  I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method, but I could not figure out how to change the class varible "delay" from within the init method.

                                                                                  Example minimal:

                                                                                  class testSpider(CrawlSpider):
                                                                                  
                                                                                      custom_settings = {
                                                                                          'DOWNLOAD_DELAY': 10,  # default value
                                                                                      }
                                                                                  
                                                                                      """ get arguments passed over CLI
                                                                                          scrapyd usage: -d arg1=val1
                                                                                          scrapy  usage: -a arg1=val1
                                                                                      """
                                                                                      def __init__(self, *args, **kwargs):
                                                                                          super(testSpider, self).__init__(*args, **kwargs)
                                                                                  
                                                                                          self.delay = kwargs.get('delay')
                                                                                  
                                                                                          if self.delay:
                                                                                              testSpider.custom_settings['DOWNLOAD_DELAY'] = self.delay
                                                                                              print('init:', testSpider.custom_settings['DOWNLOAD_DELAY'])
                                                                                  
                                                                                      print(custom_settings['DOWNLOAD_DELAY'])
                                                                                  

                                                                                  This will not change the setting unfortunatelly:

                                                                                  scrapy crawl test -a delay=5
                                                                                  10
                                                                                  init: 5
                                                                                  

                                                                                  How can the class variable be changed?

                                                                                  ANSWER

                                                                                  Answered 2021-Nov-08 at 20:06

                                                                                  I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method...
                                                                                  ...
                                                                                  scrapy crawl test -a delay=5

                                                                                  1. According to scrapy docs. (Settings/Command line options section) it is requred to use -s parameter to update setting
                                                                                    scrapy crawl test -s DOWNLOAD_DELAY=5

                                                                                  2. It is not possible to update settings during runtime in spider code from init or other methods (details in related discussion on github Update spider settings during runtime #4196

                                                                                  Source https://stackoverflow.com/questions/69882916

                                                                                  Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                                                                  Vulnerabilities

                                                                                  No vulnerabilities reported

                                                                                  Install scrapy

                                                                                  You can install using 'pip install scrapy' or download it from GitHub, PyPI.
                                                                                  You can use scrapy like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

                                                                                  Support

                                                                                  For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
                                                                                  Find more information at:
                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit
                                                                                  Install
                                                                                • PyPI

                                                                                  pip install Scrapy

                                                                                • CLONE
                                                                                • HTTPS

                                                                                  https://github.com/scrapy/scrapy.git

                                                                                • CLI

                                                                                  gh repo clone scrapy/scrapy

                                                                                • sshUrl

                                                                                  git@github.com:scrapy/scrapy.git

                                                                                • Share this Page

                                                                                  share link

                                                                                  Explore Related Topics

                                                                                  Consider Popular Crawler Libraries

                                                                                  scrapy

                                                                                  by scrapy

                                                                                  cheerio

                                                                                  by cheeriojs

                                                                                  winston

                                                                                  by winstonjs

                                                                                  pyspider

                                                                                  by binux

                                                                                  colly

                                                                                  by gocolly

                                                                                  Try Top Libraries by scrapy

                                                                                  scrapyd

                                                                                  by scrapyPython

                                                                                  scrapely

                                                                                  by scrapyHTML

                                                                                  dirbot

                                                                                  by scrapyPython

                                                                                  quotesbot

                                                                                  by scrapyPython

                                                                                  parsel

                                                                                  by scrapyPython

                                                                                  Compare Crawler Libraries with Highest Support

                                                                                  scrapy

                                                                                  by scrapy

                                                                                  pyspider

                                                                                  by binux

                                                                                  crawler4j

                                                                                  by yasserg

                                                                                  webmagic

                                                                                  by code4craft

                                                                                  cheerio

                                                                                  by cheeriojs

                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit