kandi background
Explore Kits

scrapy | fast high-level web crawling | Crawler library

 by   scrapy Python Version: 1.8.2 License: Non-SPDX

 by   scrapy Python Version: 1.8.2 License: Non-SPDX

Download this library from

kandi X-RAY | scrapy Summary

scrapy is a Python library typically used in Automation, Crawler, Selenium applications. scrapy has build file available and it has high support. However scrapy has 47 bugs, it has 6 vulnerabilities and it has a Non-SPDX License. You can install using 'pip install scrapy' or download it from GitHub, PyPI.
Scrapy, a fast high-level web crawling & scraping framework for Python.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • scrapy has a highly active ecosystem.
  • It has 42899 star(s) with 9532 fork(s). There are 1815 watchers for this library.
  • There were 3 major release(s) in the last 12 months.
  • There are 497 open issues and 2088 have been closed. On average issues are closed in 182 days. There are 268 open pull requests and 0 closed requests.
  • It has a positive sentiment in the developer community.
  • The latest version of scrapy is 1.8.2
scrapy Support
Best in #Crawler
Average in #Crawler
scrapy Support
Best in #Crawler
Average in #Crawler

quality kandi Quality

  • scrapy has 47 bugs (1 blocker, 3 critical, 31 major, 12 minor) and 588 code smells.
scrapy Quality
Best in #Crawler
Average in #Crawler
scrapy Quality
Best in #Crawler
Average in #Crawler

securitySecurity

  • scrapy has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • scrapy code analysis shows 6 unresolved vulnerabilities (1 blocker, 0 critical, 5 major, 0 minor).
  • There are 1523 security hotspots that need review.
scrapy Security
Best in #Crawler
Average in #Crawler
scrapy Security
Best in #Crawler
Average in #Crawler

license License

  • scrapy has a Non-SPDX License.
  • Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.
scrapy License
Best in #Crawler
Average in #Crawler
scrapy License
Best in #Crawler
Average in #Crawler

buildReuse

  • scrapy releases are available to install and integrate.
  • Deployable package is available in PyPI.
  • Build file is available. You can build the component from source.
  • scrapy saves you 17074 person hours of effort in developing the same functionality from scratch.
  • It has 33894 lines of code, 3646 functions and 333 files.
  • It has low code complexity. Code complexity directly impacts maintainability of the code.
scrapy Reuse
Best in #Crawler
Average in #Crawler
scrapy Reuse
Best in #Crawler
Average in #Crawler
Top functions reviewed by kandi - BETA

kandi has reviewed scrapy and discovered the below as its top functions. This is intended to give you an instant insight into scrapy implemented functionality, and help decide if they suit your requirements.

  • Called when the response is ready
    • Return a list of values
    • Creates headers from a twisted response
    • Update the values in seq
  • Create a deprecated class
    • Return the path to the class
    • Check if the given subclass is a subclass of the subclass
  • Recursively follow requests
    • Parse a selector
  • Execute scrapy
    • Handle data received from the crawler
      • Create a subclass of ScrapyRequestQueue
        • Called when an item processor is dropped
          • Download robots txt file
            • Log download errors
              • Callback function for verifying SSL connection
                • Start the crawler
                  • Follow given URLs
                    • Return media to download
                      • Return whether a cached response is fresh
                        • Runs text tests
                          • Follow a URL
                            • Returns a list of request headers
                              • Called when the request is downloaded
                                • Call download function
                                  • Configure logging

                                    Get all kandi verified functions for this library.

                                    Get all kandi verified functions for this library.

                                    scrapy Key Features

                                    Scrapy, a fast high-level web crawling & scraping framework for Python.

                                    scrapy Examples and Code Snippets

                                    See all related Code Snippets

                                    How to correclty loop links with Scrapy?

                                    copy iconCopydownload iconDownload
                                    import urllib.parse
                                    
                                    link = urllib.parse.urljoin("https://www.tandfonline.com", link)
                                    
                                    yield scrapy.Request(links, callback=self.parse_abstract_page)
                                    
                                    yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)
                                    
                                     yield response.follow(abstract_url, callback=self.parse_abstract)
                                    
                                    yield from response.follow_all(list_of_urls, callback=self.parse_abstract)
                                    
                                    yield scrapy.Request(links, callback=self.parse_abstract_page)
                                    
                                    yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)
                                    
                                     yield response.follow(abstract_url, callback=self.parse_abstract)
                                    
                                    yield from response.follow_all(list_of_urls, callback=self.parse_abstract)
                                    
                                    yield scrapy.Request(links, callback=self.parse_abstract_page)
                                    
                                    yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)
                                    
                                     yield response.follow(abstract_url, callback=self.parse_abstract)
                                    
                                    yield from response.follow_all(list_of_urls, callback=self.parse_abstract)
                                    
                                    yield scrapy.Request(links, callback=self.parse_abstract_page)
                                    
                                    yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)
                                    
                                     yield response.follow(abstract_url, callback=self.parse_abstract)
                                    
                                    yield from response.follow_all(list_of_urls, callback=self.parse_abstract)
                                    

                                    Scrapy exclude URLs containing specific text

                                    copy iconCopydownload iconDownload
                                    import scrapy
                                    from scrapy.spiders import CrawlSpider, Rule
                                    from scrapy.linkextractors import LinkExtractor
                                    
                                    class LinkscrawlItem(scrapy.Item):
                                        link = scrapy.Field()
                                        attr = scrapy.Field()
                                    
                                    class SomeSpider(CrawlSpider):
                                        name = 'mysitecrawler'
                                        allowed_domains = ['mysite.co.uk']
                                        start_urls = ['https://mysite.co.uk/']
                                    
                                        rules = (
                                            Rule (LinkExtractor(deny=('my\-account', 'cart', 'checkout', 'wp\-content')), callback="parse_obj", follow=True),
                                        )
                                    
                                        def parse_obj(self,response):
                                            item = LinkscrawlItem()
                                            item["link"] = str(response.url)+":"+str(response.status)
                                            filename = 'links2.txt'
                                            with open(filename, 'a') as f:
                                                f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
                                            self.log('Saved file %s' % filename)
                                    

                                    Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?

                                    copy iconCopydownload iconDownload
                                    # global_state.py
                                    
                                    GLOBAL_STATE = {"counter": 0}
                                    
                                    # middleware.py
                                    
                                    from global_state import GLOBAL_STATE
                                    
                                    class SeleniumMiddleware:
                                    
                                        def process_request(self, request, spider):
                                            GLOBAL_STATE["counter"] += 1
                                            self.driver.get(request.url)
                                            GLOBAL_STATE["counter"] -= 1
                                    
                                            ...
                                    
                                    # main.py
                                    
                                    from global_state import GLOBAL_STATE
                                    import threading
                                    import time
                                    
                                    def main():
                                      gst = threading.Thread(target=gs_watcher)
                                      gst.start()
                                    
                                      # Start your app here
                                    
                                    def gs_watcher():
                                      while True:
                                        print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
                                        time.sleep(1)
                                    

                                    How can I send Dynamic website content to scrapy with the html content generated by selenium browser?

                                    copy iconCopydownload iconDownload
                                    import scrapy
                                    from selenium import webdriver
                                    from selenium.webdriver.common.by import By
                                    from scrapy.http import HtmlResponse
                                    
                                    
                                    class FloorSheetSpider(scrapy.Spider):
                                        name = "nepse"
                                    
                                        def start_requests(self):
                                    
                                            # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
                                            driver = webdriver.Chrome()
                                    
                                            floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
                                    
                                            for date in floorsheet_dates:
                                                driver.get(
                                                    "https://merolagani.com/Floorsheet.aspx")
                                    
                                                driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                                                    ).send_keys(date)
                                                driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
                                                total_length = driver.find_element(By.XPATH,
                                                                                   "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
                                                z = int((total_length.split()[-1]).replace(']', ''))
                                                for data in range(1, z + 1):
                                                    driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                                                    self.body = driver.page_source
                                    
                                                    response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
                                                    for value in response.xpath('//tbody/tr'):
                                                        print(value.css('td::text').extract()[1])
                                                        print("ok"*200)
                                    
                                            # return an empty requests list
                                            return []
                                    
                                    import scrapy
                                    from scrapy import Request
                                    from scrapy.http import HtmlResponse
                                    from selenium import webdriver
                                    from selenium.webdriver.common.by import By
                                    
                                    
                                    class SeleniumMiddleware(object):
                                        def process_request(self, request, spider):
                                            url = spider.driver.current_url
                                            body = spider.driver.page_source
                                            return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)
                                    
                                    
                                    class FloorSheetSpider(scrapy.Spider):
                                        name = "nepse"
                                    
                                        custom_settings = {
                                            'DOWNLOADER_MIDDLEWARES': {
                                                'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
                                                # 'projects_name.path.to.your.pipeline': 543
                                            }
                                        }
                                        driver = webdriver.Chrome()
                                    
                                        def start_requests(self):
                                    
                                            # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
                                    
                                    
                                            floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
                                    
                                            for date in floorsheet_dates:
                                                self.driver.get(
                                                    "https://merolagani.com/Floorsheet.aspx")
                                    
                                                self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                                                    ).send_keys(date)
                                                self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
                                                total_length = self.driver.find_element(By.XPATH,
                                                                                   "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
                                                z = int((total_length.split()[-1]).replace(']', ''))
                                                for data in range(1, z + 1):
                                                    self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                                                    self.body = self.driver.page_source
                                                    self.url = self.driver.current_url
                                    
                                                    yield Request(url=self.url, callback=self.parse, dont_filter=True)
                                    
                                        def parse(self, response, **kwargs):
                                            print('test ok')
                                            for value in response.xpath('//tbody/tr'):
                                                print(value.css('td::text').extract()[1])
                                                print("ok"*200)
                                    
                                    import scrapy
                                    from selenium import webdriver
                                    from selenium.webdriver.common.by import By
                                    from scrapy.http import HtmlResponse
                                    
                                    
                                    class FloorSheetSpider(scrapy.Spider):
                                        name = "nepse"
                                    
                                        def start_requests(self):
                                    
                                            # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
                                            driver = webdriver.Chrome()
                                    
                                            floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
                                    
                                            for date in floorsheet_dates:
                                                driver.get(
                                                    "https://merolagani.com/Floorsheet.aspx")
                                    
                                                driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                                                    ).send_keys(date)
                                                driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
                                                total_length = driver.find_element(By.XPATH,
                                                                                   "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
                                                z = int((total_length.split()[-1]).replace(']', ''))
                                                for data in range(1, z + 1):
                                                    driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                                                    self.body = driver.page_source
                                    
                                                    response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
                                                    for value in response.xpath('//tbody/tr'):
                                                        print(value.css('td::text').extract()[1])
                                                        print("ok"*200)
                                    
                                            # return an empty requests list
                                            return []
                                    
                                    import scrapy
                                    from scrapy import Request
                                    from scrapy.http import HtmlResponse
                                    from selenium import webdriver
                                    from selenium.webdriver.common.by import By
                                    
                                    
                                    class SeleniumMiddleware(object):
                                        def process_request(self, request, spider):
                                            url = spider.driver.current_url
                                            body = spider.driver.page_source
                                            return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)
                                    
                                    
                                    class FloorSheetSpider(scrapy.Spider):
                                        name = "nepse"
                                    
                                        custom_settings = {
                                            'DOWNLOADER_MIDDLEWARES': {
                                                'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
                                                # 'projects_name.path.to.your.pipeline': 543
                                            }
                                        }
                                        driver = webdriver.Chrome()
                                    
                                        def start_requests(self):
                                    
                                            # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
                                    
                                    
                                            floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
                                    
                                            for date in floorsheet_dates:
                                                self.driver.get(
                                                    "https://merolagani.com/Floorsheet.aspx")
                                    
                                                self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                                                    ).send_keys(date)
                                                self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
                                                total_length = self.driver.find_element(By.XPATH,
                                                                                   "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
                                                z = int((total_length.split()[-1]).replace(']', ''))
                                                for data in range(1, z + 1):
                                                    self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                                                    self.body = self.driver.page_source
                                                    self.url = self.driver.current_url
                                    
                                                    yield Request(url=self.url, callback=self.parse, dont_filter=True)
                                    
                                        def parse(self, response, **kwargs):
                                            print('test ok')
                                            for value in response.xpath('//tbody/tr'):
                                                print(value.css('td::text').extract()[1])
                                                print("ok"*200)
                                    

                                    Scrapy display response.request.url inside zip()

                                    copy iconCopydownload iconDownload
                                    import scrapy
                                    from itertools import cycle
                                    
                                    
                                    class AlexarSpider(scrapy.Spider):
                                        name = 'AlexaR'
                                        #Will update allowed domains and start URL once I fix this problem
                                        start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                                                      'https://www.alexa.com/siteinfo/reddit.com']
                                    
                                        def parse(self, response):
                                            rank = response.css(".rankmini-rank::text").extract()
                                            url_raw = response.request.url
                                            #extract content into rows
                                            for item in zip(cycle([url_raw]), rank):
                                                scraped_info = {
                                                    str('url_raw'): item[0],
                                                    'rank': item[1]
                                                }
                                                yield scraped_info
                                    
                                    import scrapy
                                    
                                    
                                    class AlexarSpider(scrapy.Spider):
                                        name = 'AlexaR'
                                        #Will update allowed domains and start URL once I fix this problem
                                        start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                                                      'https://www.alexa.com/siteinfo/reddit.com']
                                    
                                        def parse(self, response):
                                            rank = response.css(".rankmini-rank::text").extract()
                                            url_raw = [response.request.url for i in range(len(rank))]
                                            #extract content into rows
                                            for item in zip(url_raw, rank):
                                                scraped_info = {
                                                    str('url_raw'): item[0],
                                                    'rank': item[1]
                                                }
                                                yield scraped_info
                                    
                                    import scrapy
                                    from itertools import cycle
                                    
                                    
                                    class AlexarSpider(scrapy.Spider):
                                        name = 'AlexaR'
                                        #Will update allowed domains and start URL once I fix this problem
                                        start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                                                      'https://www.alexa.com/siteinfo/reddit.com']
                                    
                                        def parse(self, response):
                                            rank = response.css(".rankmini-rank::text").extract()
                                            url_raw = response.request.url
                                            #extract content into rows
                                            for item in zip(cycle([url_raw]), rank):
                                                scraped_info = {
                                                    str('url_raw'): item[0],
                                                    'rank': item[1]
                                                }
                                                yield scraped_info
                                    
                                    import scrapy
                                    
                                    
                                    class AlexarSpider(scrapy.Spider):
                                        name = 'AlexaR'
                                        #Will update allowed domains and start URL once I fix this problem
                                        start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                                                      'https://www.alexa.com/siteinfo/reddit.com']
                                    
                                        def parse(self, response):
                                            rank = response.css(".rankmini-rank::text").extract()
                                            url_raw = [response.request.url for i in range(len(rank))]
                                            #extract content into rows
                                            for item in zip(url_raw, rank):
                                                scraped_info = {
                                                    str('url_raw'): item[0],
                                                    'rank': item[1]
                                                }
                                                yield scraped_info
                                    

                                    Yielding values from consecutive parallel parse functions via meta in Scrapy

                                    copy iconCopydownload iconDownload
                                    from scrapy import Item, Field
                                    
                                    
                                    class MvItem(Item):
                                        name = Field()
                                        party = Field()
                                        bill_prop_count = Field()
                                        res_prop_count = Field()
                                        speech_count = Field()
                                    
                                    from your_project.items import MvItem
                                    
                                    ...
                                    
                                    # COUNTING FUNCTIONS
                                    def bill_prop_counter(self,response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        billproposals = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{ 'bill_prop_count': len(billproposals),
                                                'name': name,
                                                'party': party})
                                    
                                    def quest_prop_counter(self, response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        researchproposals = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{'res_prop_count': len(researchproposals),
                                               'name': name,
                                               'party': party})
                                    
                                    def speech_counter(self, response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        speeches = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{ 'speech_count' : len(speeches),
                                               'name': name,
                                               'party': party})
                                    
                                    bill_prop_count,name,party,res_prop_count,speech_count
                                    ,Abdullah DOĞRU,AK Parti,,11
                                    ,Mehmet Şükrü ERDİNÇ,AK Parti,,3
                                    ,Muharrem VARLI,MHP,,13
                                    ,Muharrem VARLI,MHP,0,
                                    ,Jülide SARIEROĞLU,AK Parti,,3
                                    ,İbrahim Halil FIRAT,AK Parti,,7
                                    20,Burhanettin BULUT,CHP,,
                                    ,Ünal DEMİRTAŞ,CHP,,22
                                    ...
                                    
                                    from scrapy import Item, Field
                                    
                                    
                                    class MvItem(Item):
                                        name = Field()
                                        party = Field()
                                        bill_prop_count = Field()
                                        res_prop_count = Field()
                                        speech_count = Field()
                                    
                                    from your_project.items import MvItem
                                    
                                    ...
                                    
                                    # COUNTING FUNCTIONS
                                    def bill_prop_counter(self,response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        billproposals = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{ 'bill_prop_count': len(billproposals),
                                                'name': name,
                                                'party': party})
                                    
                                    def quest_prop_counter(self, response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        researchproposals = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{'res_prop_count': len(researchproposals),
                                               'name': name,
                                               'party': party})
                                    
                                    def speech_counter(self, response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        speeches = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{ 'speech_count' : len(speeches),
                                               'name': name,
                                               'party': party})
                                    
                                    bill_prop_count,name,party,res_prop_count,speech_count
                                    ,Abdullah DOĞRU,AK Parti,,11
                                    ,Mehmet Şükrü ERDİNÇ,AK Parti,,3
                                    ,Muharrem VARLI,MHP,,13
                                    ,Muharrem VARLI,MHP,0,
                                    ,Jülide SARIEROĞLU,AK Parti,,3
                                    ,İbrahim Halil FIRAT,AK Parti,,7
                                    20,Burhanettin BULUT,CHP,,
                                    ,Ünal DEMİRTAŞ,CHP,,22
                                    ...
                                    
                                    from scrapy import Item, Field
                                    
                                    
                                    class MvItem(Item):
                                        name = Field()
                                        party = Field()
                                        bill_prop_count = Field()
                                        res_prop_count = Field()
                                        speech_count = Field()
                                    
                                    from your_project.items import MvItem
                                    
                                    ...
                                    
                                    # COUNTING FUNCTIONS
                                    def bill_prop_counter(self,response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        billproposals = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{ 'bill_prop_count': len(billproposals),
                                                'name': name,
                                                'party': party})
                                    
                                    def quest_prop_counter(self, response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        researchproposals = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{'res_prop_count': len(researchproposals),
                                               'name': name,
                                               'party': party})
                                    
                                    def speech_counter(self, response):
                                        name = response.meta.get('name')
                                        party = response.meta.get('party')
                                    
                                        speeches = response.xpath("//tr[@valign='TOP']")
                                    
                                        yield MvItem(**{ 'speech_count' : len(speeches),
                                               'name': name,
                                               'party': party})
                                    
                                    bill_prop_count,name,party,res_prop_count,speech_count
                                    ,Abdullah DOĞRU,AK Parti,,11
                                    ,Mehmet Şükrü ERDİNÇ,AK Parti,,3
                                    ,Muharrem VARLI,MHP,,13
                                    ,Muharrem VARLI,MHP,0,
                                    ,Jülide SARIEROĞLU,AK Parti,,3
                                    ,İbrahim Halil FIRAT,AK Parti,,7
                                    20,Burhanettin BULUT,CHP,,
                                    ,Ünal DEMİRTAŞ,CHP,,22
                                    ...
                                    

                                    Parsing an 'Load More' response with HTML content

                                    copy iconCopydownload iconDownload
                                    import scrapy
                                    import json
                                    
                                    
                                    class DuyurularSpider(scrapy.Spider):
                                        name = 'duyurular'
                                        allowed_domains = ['istanbul.gov.tr']
                                        start_urls = ['http://istanbul.gov.tr/duyurular']
                                        page = 1
                                        
                                        headers = {
                                            "Accept": "*/*",
                                            "Accept-Encoding": "gzip, deflate",
                                            "Accept-Language": "en-US,en;q=0.9",
                                            "Connection": "keep-alive",
                                            "Content-Type": "application/json",
                                            "Host": "istanbul.gov.tr",
                                            "Origin": "http://istanbul.gov.tr",
                                            "Referer": "http://istanbul.gov.tr/duyurular",
                                            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
                                            "X-Requested-With": "XMLHttpRequest",
                                        }
                                    
                                        def parse(self, response):
                                            url = 'http://istanbul.gov.tr/ISAYWebPart/Announcement/AnnouncementDahaFazlaYukle'
                                            body = {
                                                "ContentCount": "8",
                                                "ContentTypeId": "D6mHJdtwBYsvtS2xCvXiww==",
                                                "GosterimSekli": "1",
                                                "OrderByAsc": "true",
                                                "page": f"{str(self.page)}"
                                            }
                                    
                                            if response.body.strip():    # check if we get an empty page
                                                load_more = scrapy.Request(url, method="POST", headers=self.headers, body=json.dumps(body))
                                                yield load_more
                                                self.page += 1
                                    
                                                data = response.xpath('//div[@class="ministry-announcements"]')
                                                for bilgi in data:
                                                    gun = bilgi.xpath('.//div[@class = "day"]/text()').extract_first()  #day
                                                    ay = bilgi.xpath('.//div[@class = "month"]/text()').extract_first() #month
                                    
                                                    metin = bilgi.xpath('.//a[@class ="announce-text"]/text()').extract_first() #text
                                    
                                                    yield {
                                                        'Ay:': ay,
                                                        'Gün': gun,
                                                        'Metin': metin.strip(),
                                                    }
                                    

                                    loop through multiple URLs to scrape from a CSV file in Scrapy is not working

                                    copy iconCopydownload iconDownload
                                    from scrapy.http import request
                                    import pandas as pd
                                    
                                    
                                    class PagedataSpider(scrapy.Spider):
                                        name = 'pagedata'
                                        allowed_domains = ['www.imdb.com']
                                    
                                        def start_requests(self):
                                            df = pd.read_csv('list1.csv')
                                    
                                            urls = df['link']
                                            for url in urls:
                                                yield scrapy.Request(url=url, callback=self.parse)
                                    

                                    During recursive scraping in scrapy, how extract info from multiple nodes of parent url and associated children url together?

                                    copy iconCopydownload iconDownload
                                    import scrapy
                                    class AuthorSpider(scrapy.Spider):
                                        name = 'quotes1'
                                        start_urls = [f'https://quotes.toscrape.com/page/{x}/' .format(x) for x in range(1,11)]
                                    
                                        def parse(self, response):
                                    
                                            for quote in response.css('div.quote'):
                                                Author = quote.css('span.text::text').get()  # <----
                                    
                                                authShortLink = quote.css('small.author + a::attr(href)').get()
                                                authFullLink = response.urljoin(authShortLink)
                                                yield scrapy.Request(url=authFullLink, callback=self.parse_author, meta={'Author': Author})
                                    
                                            # # looping through next pages
                                            # nextPage = response.css('li.next a::attr(href)').get()
                                            # abs_url = f'http://quotes.toscrape.com/{nextPage}'
                                                #yield scrapy.Request(url=abs_url, callback=self.parse)
                                    
                                        def parse_author(self, response):
                                            quote=response.meta.get('Author')
                                            yield {
                                                'Name': response.css('h3.author-title::text').get().strip(),
                                                'Date of birth': response.css('span.author-born-date::text').get(),
                                                'Quote':quote,
                                                'url':response.url}
                                      
                                    

                                    CSS selector of link to the next page returns empty list in Scrapy shell

                                    copy iconCopydownload iconDownload
                                    # with css:
                                    In [1]: response.css('head > link:nth-child(28) ::attr(href)').get()                                                   
                                    Out[1]: 'https://book24.ru/knigi-bestsellery/page-2/'
                                    
                                    # with xpath:
                                    In [2]: response.xpath('//link[@rel="next"]/@href').get()
                                    Out[2]: 'https://book24.ru/knigi-bestsellery/page-2/'
                                    

                                    See all related Code Snippets

                                    Community Discussions

                                    Trending Discussions on scrapy
                                    • How to correclty loop links with Scrapy?
                                    • Scrapy exclude URLs containing specific text
                                    • Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
                                    • How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
                                    • Scrapy display response.request.url inside zip()
                                    • Yielding values from consecutive parallel parse functions via meta in Scrapy
                                    • Parsing an 'Load More' response with HTML content
                                    • loop through multiple URLs to scrape from a CSV file in Scrapy is not working
                                    • During recursive scraping in scrapy, how extract info from multiple nodes of parent url and associated children url together?
                                    • How to set class variable through __init__ in Python?
                                    Trending Discussions on scrapy

                                    QUESTION

                                    How to correclty loop links with Scrapy?

                                    Asked 2022-Mar-03 at 09:22

                                    I'm using Scrapy and I'm having some problems while loop through a link.

                                    I'm scraping the majority of information from one single page except one which points to another page.

                                    There are 10 articles on each page. For each article I have to get the abstract which is on a second page. The correspondence between articles and abstracts is 1:1.

                                    Here the divsection I'm using to scrape the data:

                                    <div class="articleEntry">
                                        <div class="tocArticleEntry include-metrics-panel toc-article-tools">
                                            <div class="item-checkbox-container" role="checkbox" aria-checked="false" aria-labelledby="article-d401999e88">
                                                <label tabindex="0" class="checkbox--primary"><input type="checkbox"
                                                        name="10.1080/03066150.2021.1956473"><span class="box-btn"></span></label></div><span
                                                class="article-type">Article</span>
                                            <div class="art_title linkable"><a class="ref nowrap" href="/doi/full/10.1080/03066150.2021.1956473"><span
                                                        class="hlFld-Title" id="article-d401999e88">Climate change and agrarian struggles: an invitation to
                                                        contribute to a <i>JPS</i> Forum</span></a></div>
                                            <div class="tocentryright">
                                                <div class="tocAuthors afterTitle">
                                                    <div class="articleEntryAuthor all"><span class="articleEntryAuthorsLinks"><span><a
                                                                    href="/author/Borras+Jr.%2C+Saturnino+M">Saturnino M. Borras Jr.</a></span>, <span><a
                                                                    href="/author/Scoones%2C+Ian">Ian Scoones</a></span>, <span><a
                                                                    href="/author/Baviskar%2C+Amita">Amita Baviskar</a></span>, <span><a
                                                                    href="/author/Edelman%2C+Marc">Marc Edelman</a></span>, <span><a
                                                                    href="/author/Peluso%2C+Nancy+Lee">Nancy Lee Peluso</a></span> &amp; <span><a
                                                                    href="/author/Wolford%2C+Wendy">Wendy Wolford</a></span></span></div>
                                                </div>
                                                <div class="tocPageRange maintextleft">Pages: 1-28</div>
                                                <div class="tocEPubDate"><span class="maintextleft"><strong>Published online:</strong><span class="date"> 06
                                                            Aug 2021</span></span></div>
                                            </div>
                                            <div class="sfxLinkButton"></div>
                                            <div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
                                                    class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
                                                    class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
                                                    class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
                                                    href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
                                                    href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
                                                    href="/servlet/linkout?type=rightslink&amp;url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
                                                    class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
                                            <div class="metrics-panel">
                                                <ul class="altmetric-score true">
                                                    <li><span>6049</span> Views</li>
                                                    <li><span>0</span> CrossRef citations</li>
                                                    <li class="value" data-doi="10.1080/03066150.2021.1956473"><span class="metrics-score">0</span>Altmetric
                                                    </li>
                                                </ul>
                                            </div><span class="access-icon oa" role="img" aria-label="Access provided by Open Access"></span><span
                                                class="part-tooltip">Open Access</span>
                                        </div>
                                    </div>
                                    

                                    To do so I have defined the following script

                                    from cgitb import text
                                    import scrapy
                                    import pandas as pd
                                    
                                    
                                    class QuotesSpider(scrapy.Spider):
                                        name = "jps"
                                    
                                        start_urls = ['https://www.tandfonline.com/toc/fjps20/current']
                                        
                                    
                                        def parse(self, response):
                                            self.logger.info('hello this is my first spider')
                                            Title = response.xpath("//span[@class='hlFld-Title']").extract()
                                            Authors = response.xpath("//span[@class='articleEntryAuthorsLinks']").extract()
                                            License = response.xpath("//span[@class='part-tooltip']").extract()
                                            abstract_url = response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract()
                                            row_data = zip(Title, Authors, License, abstract_url)
                                            
                                            for quote in row_data:
                                                scraped_info = {
                                                    # key:value
                                                    'Title': quote[0],
                                                    'Authors': quote[1],
                                                    'License': quote[2],
                                                    'Abstract': quote[3]
                                                }
                                                # yield/give the scraped info to scrapy
                                                yield scraped_info
                                        
                                        
                                        def parse_links(self, response):
                                            
                                            for links in response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract():
                                                yield scrapy.Request(links, callback=self.parse_abstract_page)
                                            #yield response.follow(abstract_url, callback=self.parse_abstract_page)
                                        
                                        def parse_abstract_page(self, response):
                                            Abstract = response.xpath("//div[@class='hlFld-Abstract']").extract_first()
                                            row_data = zip(Abstract)
                                            for quote in row_data:
                                                scraped_info_abstract = {
                                                    # key:value
                                                    'Abstract': quote[0]
                                                }
                                                # yield/give the scraped info to scrapy
                                                yield scraped_info_abstract
                                            
                                    

                                    Authors, title and license are correctly scraped. For the Abstract I'm having the following error:

                                    ValueError: Missing scheme in request url: /doi/abs/10.1080/03066150.2021.1956473
                                    

                                    To check if the path was correct I removed the abstract_url from the loop:

                                     abstract_url = response.xpath('// [@class="tocDeliverFormatsLinks"]/a/@href').extract_first()
                                     self.logger.info('get abstract page url')
                                     yield response.follow(abstract_url, callback=self.parse_abstract)
                                    

                                    I can correctly reach the abstract corresponding to the first article, but not the others. I think the error is in the loop.

                                    How can I solve this issue?

                                    Thanks

                                    ANSWER

                                    Answered 2022-Mar-01 at 19:43

                                    The link to the article abstract appears to be a relative link (from the exception). /doi/abs/10.1080/03066150.2021.1956473 doesn't start with https:// or http://.

                                    You should append this relative URL to the base URL of the website (i.e. if the base URL is "https://www.tandfonline.com", you can

                                    import urllib.parse
                                    
                                    link = urllib.parse.urljoin("https://www.tandfonline.com", link)
                                    

                                    Then you'll have a proper URL to the resource.

                                    Source https://stackoverflow.com/questions/71308962

                                    Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                    Vulnerabilities

                                    No vulnerabilities reported

                                    Install scrapy

                                    You can install using 'pip install scrapy' or download it from GitHub, PyPI.
                                    You can use scrapy like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

                                    Support

                                    For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

                                    DOWNLOAD this Library from

                                    Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                                    over 430 million Knowledge Items
                                    Find more libraries
                                    Reuse Solution Kits and Libraries Curated by Popular Use Cases
                                    Explore Kits

                                    Save this library and start creating your kit

                                    Explore Related Topics

                                    Share this Page

                                    share link
                                    Consider Popular Crawler Libraries
                                    Try Top Libraries by scrapy
                                    Compare Crawler Libraries with Highest Support
                                    Compare Crawler Libraries with Highest Quality
                                    Compare Crawler Libraries with Highest Security
                                    Compare Crawler Libraries with Permissive License
                                    Compare Crawler Libraries with Highest Reuse
                                    Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                                    over 430 million Knowledge Items
                                    Find more libraries
                                    Reuse Solution Kits and Libraries Curated by Popular Use Cases
                                    Explore Kits

                                    Save this library and start creating your kit

                                    • © 2022 Open Weaver Inc.