Popular New Releases in Crawler
scrapy
1.8.2
cheerio
winston
v3.7.1
crawlab
v0.6.0-beta.20211224
InfoSpider
INFO-SPIDER v1.0
Popular Libraries in Crawler
by scrapy python
42899 NOASSERTION
Scrapy, a fast high-level web crawling & scraping framework for Python.
by cheeriojs typescript
24992 MIT
Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
by winstonjs javascript
18711 MIT
A logger for just about everything.
by binux python
15124 Apache-2.0
A Powerful Spider(Web Crawler) System in Python.
by gocolly go
14735 Apache-2.0
Elegant Scraper and Crawler Framework for Golang
by Jack-Cherish python
13287
:rainbow:Python3网络爬虫实战:淘宝、京东、网易云、B站、12306、抖音、笔趣阁、漫画小说下载、音乐电影下载等
by jhao104 python
13159 MIT
Python爬虫代理IP池(proxy pool)
by code4craft java
10324 Apache-2.0
A scalable web crawler framework for Java.
by lingochamp java
10289 Apache-2.0
Multitask、MultiThread(MultiConnection)、Breakpoint-resume、High-concurrency、Simple to use、Single/NotSingle-process
Trending New libraries in Crawler
by kangvcar python
5976 GPL-3.0
INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
by JAVClub javascript
2644 MIT
🔞 JAVClub - 让你的大姐姐不再走丢
by 201206030 java
1931 Apache-2.0
小说精品屋-plus是一个多端(PC、WAP)阅读、功能完善的原创文学CMS系统,由前台门户系统、作家后台管理系统、平台后台管理系统、爬虫管理系统等多个子系统构成,支持多模版、会员充值、订阅模式、新闻发布和实时统计报表等功能,新书自动入库,老书自动更新。
by BlankerL python
1827 MIT
2019新型冠状病毒疫情实时爬虫及API | COVID-19/2019-nCoV Realtime Infection Crawler and API
by MoyuScript python
1241 GPL-3.0
哔哩哔哩的API调用模块
by soxoj python
1170 MIT
🕵️♂️ Collect a dossier on a person by username from thousands of sites
by Passkou python
959 GPL-3.0
哔哩哔哩的API调用模块
by jaeles-project go
856 MIT
Gospider - Fast web spider written in Go
by Boris-code python
771 NOASSERTION
feapder是一款支持分布式、批次采集、任务防丢、报警丰富的python爬虫框架
Top Authors in Crawler
1
21 Libraries
1554
2
11 Libraries
1206
3
11 Libraries
3217
4
8 Libraries
600
5
7 Libraries
93
6
6 Libraries
41
7
6 Libraries
105
8
6 Libraries
448
9
6 Libraries
158
10
6 Libraries
34
1
21 Libraries
1554
2
11 Libraries
1206
3
11 Libraries
3217
4
8 Libraries
600
5
7 Libraries
93
6
6 Libraries
41
7
6 Libraries
105
8
6 Libraries
448
9
6 Libraries
158
10
6 Libraries
34
Trending Kits in Crawler
No Trending Kits are available at this moment for Crawler
Trending Discussions on Crawler
How to test form submission with wrong values using Symfony crawler component and PHPUnit?
Setting proxies when crawling websites with Python
Can't Successfully Run AWS Glue Job That Reads From DynamoDB
Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
How to set class variable through __init__ in Python?
headless chrome on docker M1 error - unable to discover open window in chrome
How do I pass in arguments non-interactive into a bash file that uses "read"?
Scrapy crawls duplicate data
AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job
QUESTION
How to test form submission with wrong values using Symfony crawler component and PHPUnit?
Asked 2022-Apr-05 at 11:18When you're using the app through the browser, you send a bad value, the system checks for errors in the form, and if something goes wrong (it does in this case), it redirects with a default error message written below the incriminated field.
This is the behaviour I am trying to assert with my test case, but I came accross an \InvalidArgumentException I was not expecting.
I am using the symfony/phpunit-bridge with phpunit/phpunit v8.5.23 and symfony/dom-crawler v5.3.7. Here's a sample of what it looks like :
1public function testPayloadNotRespectingFieldLimits(): void
2{
3 $client = static::createClient();
4
5 /** @var SomeRepository $repo */
6 $repo = self::getContainer()->get(SomeRepository::class);
7 $countEntries = $repo->count([]);
8
9 $crawler = $client->request(
10 'GET',
11 '/route/to/form/add'
12 );
13 $this->assertResponseIsSuccessful(); // Goes ok.
14
15 $form = $crawler->filter('[type=submit]')->form(); // It does retrieve my form node.
16
17 // This is where it's not working.
18 $form->setValues([
19 'some[name]' => 'Someokvalue',
20 'some[color]' => 'SomeNOTOKValue', // It is a ChoiceType with limited values, where 'SomeNOTOKValue' does not belong. This is the line that throws an \InvalidArgumentException.
21 )];
22
23 // What I'd like to assert after this
24 $client->submit($form);
25 $this->assertResponseRedirects();
26 $this->assertEquals($countEntries, $repo->count([]));
27}
28
Here's the exception message I get :
1public function testPayloadNotRespectingFieldLimits(): void
2{
3 $client = static::createClient();
4
5 /** @var SomeRepository $repo */
6 $repo = self::getContainer()->get(SomeRepository::class);
7 $countEntries = $repo->count([]);
8
9 $crawler = $client->request(
10 'GET',
11 '/route/to/form/add'
12 );
13 $this->assertResponseIsSuccessful(); // Goes ok.
14
15 $form = $crawler->filter('[type=submit]')->form(); // It does retrieve my form node.
16
17 // This is where it's not working.
18 $form->setValues([
19 'some[name]' => 'Someokvalue',
20 'some[color]' => 'SomeNOTOKValue', // It is a ChoiceType with limited values, where 'SomeNOTOKValue' does not belong. This is the line that throws an \InvalidArgumentException.
21 )];
22
23 // What I'd like to assert after this
24 $client->submit($form);
25 $this->assertResponseRedirects();
26 $this->assertEquals($countEntries, $repo->count([]));
27}
28InvalidArgumentException: Input "some[color]" cannot take "SomeNOTOKValue" as a value (possible values: "red", "pink", "purple", "white").
29vendor/symfony/dom-crawler/Field/ChoiceFormField.php:140
30vendor/symfony/dom-crawler/FormFieldRegistry.php:113
31vendor/symfony/dom-crawler/Form.php:75
32
The ColorChoiceType tested here is pretty standard :
1public function testPayloadNotRespectingFieldLimits(): void
2{
3 $client = static::createClient();
4
5 /** @var SomeRepository $repo */
6 $repo = self::getContainer()->get(SomeRepository::class);
7 $countEntries = $repo->count([]);
8
9 $crawler = $client->request(
10 'GET',
11 '/route/to/form/add'
12 );
13 $this->assertResponseIsSuccessful(); // Goes ok.
14
15 $form = $crawler->filter('[type=submit]')->form(); // It does retrieve my form node.
16
17 // This is where it's not working.
18 $form->setValues([
19 'some[name]' => 'Someokvalue',
20 'some[color]' => 'SomeNOTOKValue', // It is a ChoiceType with limited values, where 'SomeNOTOKValue' does not belong. This is the line that throws an \InvalidArgumentException.
21 )];
22
23 // What I'd like to assert after this
24 $client->submit($form);
25 $this->assertResponseRedirects();
26 $this->assertEquals($countEntries, $repo->count([]));
27}
28InvalidArgumentException: Input "some[color]" cannot take "SomeNOTOKValue" as a value (possible values: "red", "pink", "purple", "white").
29vendor/symfony/dom-crawler/Field/ChoiceFormField.php:140
30vendor/symfony/dom-crawler/FormFieldRegistry.php:113
31vendor/symfony/dom-crawler/Form.php:75
32public function configureOptions(OptionsResolver $resolver): void
33{
34 $resolver->setDefaults([
35 'choices' => ColorEnumType::getChoices(),
36 'multiple' => false,
37 )];
38}
39
What I can do, is to wrap in a try-catch block, the line where it sets the wrong value. And it would indeed submit the form and proceed to the next assertion. The issue here is that the form was considered submitted and valid, it forced an appropriate value for the color field (the first choice of the enum set). This is not what I get when I try this in my browser (cf. the intro).
1public function testPayloadNotRespectingFieldLimits(): void
2{
3 $client = static::createClient();
4
5 /** @var SomeRepository $repo */
6 $repo = self::getContainer()->get(SomeRepository::class);
7 $countEntries = $repo->count([]);
8
9 $crawler = $client->request(
10 'GET',
11 '/route/to/form/add'
12 );
13 $this->assertResponseIsSuccessful(); // Goes ok.
14
15 $form = $crawler->filter('[type=submit]')->form(); // It does retrieve my form node.
16
17 // This is where it's not working.
18 $form->setValues([
19 'some[name]' => 'Someokvalue',
20 'some[color]' => 'SomeNOTOKValue', // It is a ChoiceType with limited values, where 'SomeNOTOKValue' does not belong. This is the line that throws an \InvalidArgumentException.
21 )];
22
23 // What I'd like to assert after this
24 $client->submit($form);
25 $this->assertResponseRedirects();
26 $this->assertEquals($countEntries, $repo->count([]));
27}
28InvalidArgumentException: Input "some[color]" cannot take "SomeNOTOKValue" as a value (possible values: "red", "pink", "purple", "white").
29vendor/symfony/dom-crawler/Field/ChoiceFormField.php:140
30vendor/symfony/dom-crawler/FormFieldRegistry.php:113
31vendor/symfony/dom-crawler/Form.php:75
32public function configureOptions(OptionsResolver $resolver): void
33{
34 $resolver->setDefaults([
35 'choices' => ColorEnumType::getChoices(),
36 'multiple' => false,
37 )];
38}
39// ...
40/** @var SomeRepository $repo */
41$repo = self::getContainer()->get(SomeRepository::class);
42$countEntries = $repo->count([]); // Gives 0.
43// ...
44try {
45 $form->setValues([
46 'some[name]' => 'Someokvalue',
47 'some[color]' => 'SomeNOTOKValue',
48 ]);
49} catch (\InvalidArgumentException $e) {}
50
51$client->submit($form); // Now it submits the form.
52$this->assertResponseRedirects(); // Ok.
53$this->assertEquals($countEntries, $repo->count([])); // Failed asserting that 1 matches expected 0. !!
54
How can I mimic the browser behaviour in my test case and make asserts on it ?
ANSWER
Answered 2022-Apr-05 at 11:17It seems that you can disable validation on the DomCrawler\Form component. Based on the official documentation here.
So doing this, now works as expected :
1public function testPayloadNotRespectingFieldLimits(): void
2{
3 $client = static::createClient();
4
5 /** @var SomeRepository $repo */
6 $repo = self::getContainer()->get(SomeRepository::class);
7 $countEntries = $repo->count([]);
8
9 $crawler = $client->request(
10 'GET',
11 '/route/to/form/add'
12 );
13 $this->assertResponseIsSuccessful(); // Goes ok.
14
15 $form = $crawler->filter('[type=submit]')->form(); // It does retrieve my form node.
16
17 // This is where it's not working.
18 $form->setValues([
19 'some[name]' => 'Someokvalue',
20 'some[color]' => 'SomeNOTOKValue', // It is a ChoiceType with limited values, where 'SomeNOTOKValue' does not belong. This is the line that throws an \InvalidArgumentException.
21 )];
22
23 // What I'd like to assert after this
24 $client->submit($form);
25 $this->assertResponseRedirects();
26 $this->assertEquals($countEntries, $repo->count([]));
27}
28InvalidArgumentException: Input "some[color]" cannot take "SomeNOTOKValue" as a value (possible values: "red", "pink", "purple", "white").
29vendor/symfony/dom-crawler/Field/ChoiceFormField.php:140
30vendor/symfony/dom-crawler/FormFieldRegistry.php:113
31vendor/symfony/dom-crawler/Form.php:75
32public function configureOptions(OptionsResolver $resolver): void
33{
34 $resolver->setDefaults([
35 'choices' => ColorEnumType::getChoices(),
36 'multiple' => false,
37 )];
38}
39// ...
40/** @var SomeRepository $repo */
41$repo = self::getContainer()->get(SomeRepository::class);
42$countEntries = $repo->count([]); // Gives 0.
43// ...
44try {
45 $form->setValues([
46 'some[name]' => 'Someokvalue',
47 'some[color]' => 'SomeNOTOKValue',
48 ]);
49} catch (\InvalidArgumentException $e) {}
50
51$client->submit($form); // Now it submits the form.
52$this->assertResponseRedirects(); // Ok.
53$this->assertEquals($countEntries, $repo->count([])); // Failed asserting that 1 matches expected 0. !!
54$form = $crawler->filter('[type=submit]')->form()->disableValidation();
55$form->setValues([
56 'some[name]' => 'Someokvalue',
57 'some[color]' => 'SomeNOTOKValue',
58];
59$client->submit($form);
60
61$this->assertEquals($entriesBefore, $repo->count([]); // Now passes.
62
QUESTION
Setting proxies when crawling websites with Python
Asked 2022-Mar-12 at 18:30I want to set proxies to my crawler. I'm using requests module and Beautiful Soup. I have found a list of API links that provide free proxies with 4 types of protocols.
All proxies with 3/4 protocols work (HTTP, SOCKS4, SOCKS5) except one, and thats proxies with HTTPS protocol. This is my code:
1from bs4 import BeautifulSoup
2import requests
3import random
4import json
5
6# LIST OF FREE PROXY APIS, THESE PROXIES ARE LAST TIME TESTED 50 MINUTES AGO, PROTOCOLS: HTTP, HTTPS, SOCKS4 AND SOCKS5
7list_of_proxy_content = ["https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=CH&protocols=http%2Chttps%2Csocks4%2Csocks5",
8 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=FR&protocols=http%2Chttps%2Csocks4%2Csocks5",
9 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=DE&protocols=http%2Chttps%2Csocks4%2Csocks5",
10 "https://proxylist.geonode.com/api/proxy-list?limit=1500&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=AT&protocols=http%2Chttps%2Csocks4%2Csocks5",
11 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=IT&protocols=http%2Chttps%2Csocks4%2Csocks5"]
12
13
14# EXTRACTING JSON DATA FROM THIS LIST OF PROXIES
15full_proxy_list = []
16for proxy_url in list_of_proxy_content:
17
18 proxy_json = requests.get(proxy_url).text
19 proxy_json = json.loads(proxy_json)
20 proxy_json = proxy_json["data"]
21
22 full_proxy_list.extend(proxy_json)
23
24# CREATING PROXY DICT
25final_proxy_list = []
26for proxy in full_proxy_list:
27
28 #print(proxy) # JSON VALUE FOR ALL DATA THAT GOES INTO PROXY
29
30 protocol = proxy['protocols'][0]
31 ip_ = proxy['ip']
32 port = proxy['port']
33
34 proxy = {protocol : protocol + '://' + ip_ + ':' + port}
35
36 final_proxy_list.append(proxy)
37
38
39# TRYING PROXY ON 3 DIFERENT WEBSITES
40for proxy in final_proxy_list:
41
42 print(proxy)
43 try:
44 r0 = requests.get("https://edition.cnn.com/", proxies=proxy, timeout = 15)
45 if r0.status_code == 200:
46 print("GOOD PROXY")
47 else:
48 print("BAD PROXY")
49 except:
50 print("proxy error")
51
52 try:
53 r1 = requests.get("https://www.buelach.ch/", proxies=proxy, timeout = 15)
54 if r1.status_code == 200:
55 print("GOOD PROXY")
56 else:
57 print("BAD PROXY")
58 except:
59 print("proxy error")
60
61 try:
62 r2 = requests.get("https://www.blog.police.be.ch/", proxies=proxy, timeout = 15)
63 if r2.status_code == 200:
64 print("GOOD PROXY")
65 else:
66 print("BAD PROXY")
67 except:
68 print("proxy error")
69
70 print()
71
My question is, why does HTTPS proxies do not work, what am I doing wrong?
My proxies look like this:
1from bs4 import BeautifulSoup
2import requests
3import random
4import json
5
6# LIST OF FREE PROXY APIS, THESE PROXIES ARE LAST TIME TESTED 50 MINUTES AGO, PROTOCOLS: HTTP, HTTPS, SOCKS4 AND SOCKS5
7list_of_proxy_content = ["https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=CH&protocols=http%2Chttps%2Csocks4%2Csocks5",
8 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=FR&protocols=http%2Chttps%2Csocks4%2Csocks5",
9 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=DE&protocols=http%2Chttps%2Csocks4%2Csocks5",
10 "https://proxylist.geonode.com/api/proxy-list?limit=1500&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=AT&protocols=http%2Chttps%2Csocks4%2Csocks5",
11 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=IT&protocols=http%2Chttps%2Csocks4%2Csocks5"]
12
13
14# EXTRACTING JSON DATA FROM THIS LIST OF PROXIES
15full_proxy_list = []
16for proxy_url in list_of_proxy_content:
17
18 proxy_json = requests.get(proxy_url).text
19 proxy_json = json.loads(proxy_json)
20 proxy_json = proxy_json["data"]
21
22 full_proxy_list.extend(proxy_json)
23
24# CREATING PROXY DICT
25final_proxy_list = []
26for proxy in full_proxy_list:
27
28 #print(proxy) # JSON VALUE FOR ALL DATA THAT GOES INTO PROXY
29
30 protocol = proxy['protocols'][0]
31 ip_ = proxy['ip']
32 port = proxy['port']
33
34 proxy = {protocol : protocol + '://' + ip_ + ':' + port}
35
36 final_proxy_list.append(proxy)
37
38
39# TRYING PROXY ON 3 DIFERENT WEBSITES
40for proxy in final_proxy_list:
41
42 print(proxy)
43 try:
44 r0 = requests.get("https://edition.cnn.com/", proxies=proxy, timeout = 15)
45 if r0.status_code == 200:
46 print("GOOD PROXY")
47 else:
48 print("BAD PROXY")
49 except:
50 print("proxy error")
51
52 try:
53 r1 = requests.get("https://www.buelach.ch/", proxies=proxy, timeout = 15)
54 if r1.status_code == 200:
55 print("GOOD PROXY")
56 else:
57 print("BAD PROXY")
58 except:
59 print("proxy error")
60
61 try:
62 r2 = requests.get("https://www.blog.police.be.ch/", proxies=proxy, timeout = 15)
63 if r2.status_code == 200:
64 print("GOOD PROXY")
65 else:
66 print("BAD PROXY")
67 except:
68 print("proxy error")
69
70 print()
71{'socks4': 'socks4://185.168.173.35:5678'}
72{'http': 'http://62.171.177.80:3128'}
73{'https': 'http://159.89.28.169:3128'}
74
I have seen that sometimes people pass proxies like this:
1from bs4 import BeautifulSoup
2import requests
3import random
4import json
5
6# LIST OF FREE PROXY APIS, THESE PROXIES ARE LAST TIME TESTED 50 MINUTES AGO, PROTOCOLS: HTTP, HTTPS, SOCKS4 AND SOCKS5
7list_of_proxy_content = ["https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=CH&protocols=http%2Chttps%2Csocks4%2Csocks5",
8 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=FR&protocols=http%2Chttps%2Csocks4%2Csocks5",
9 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=DE&protocols=http%2Chttps%2Csocks4%2Csocks5",
10 "https://proxylist.geonode.com/api/proxy-list?limit=1500&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=AT&protocols=http%2Chttps%2Csocks4%2Csocks5",
11 "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=IT&protocols=http%2Chttps%2Csocks4%2Csocks5"]
12
13
14# EXTRACTING JSON DATA FROM THIS LIST OF PROXIES
15full_proxy_list = []
16for proxy_url in list_of_proxy_content:
17
18 proxy_json = requests.get(proxy_url).text
19 proxy_json = json.loads(proxy_json)
20 proxy_json = proxy_json["data"]
21
22 full_proxy_list.extend(proxy_json)
23
24# CREATING PROXY DICT
25final_proxy_list = []
26for proxy in full_proxy_list:
27
28 #print(proxy) # JSON VALUE FOR ALL DATA THAT GOES INTO PROXY
29
30 protocol = proxy['protocols'][0]
31 ip_ = proxy['ip']
32 port = proxy['port']
33
34 proxy = {protocol : protocol + '://' + ip_ + ':' + port}
35
36 final_proxy_list.append(proxy)
37
38
39# TRYING PROXY ON 3 DIFERENT WEBSITES
40for proxy in final_proxy_list:
41
42 print(proxy)
43 try:
44 r0 = requests.get("https://edition.cnn.com/", proxies=proxy, timeout = 15)
45 if r0.status_code == 200:
46 print("GOOD PROXY")
47 else:
48 print("BAD PROXY")
49 except:
50 print("proxy error")
51
52 try:
53 r1 = requests.get("https://www.buelach.ch/", proxies=proxy, timeout = 15)
54 if r1.status_code == 200:
55 print("GOOD PROXY")
56 else:
57 print("BAD PROXY")
58 except:
59 print("proxy error")
60
61 try:
62 r2 = requests.get("https://www.blog.police.be.ch/", proxies=proxy, timeout = 15)
63 if r2.status_code == 200:
64 print("GOOD PROXY")
65 else:
66 print("BAD PROXY")
67 except:
68 print("proxy error")
69
70 print()
71{'socks4': 'socks4://185.168.173.35:5678'}
72{'http': 'http://62.171.177.80:3128'}
73{'https': 'http://159.89.28.169:3128'}
74proxies = {"http": "http://10.10.1.10:3128",
75 "https": "http://10.10.1.10:1080"}
76
But this dict has 2 protocols, but in links its only http, why? Can I pass only one, can I pass 10 different IP addresses in this dict?
ANSWER
Answered 2021-Sep-17 at 16:08I did some research on the topic and now I'm confused why you want a proxy for HTTPS.
While it is understandable to want a proxy for HTTP, (HTTP is unencrypted) HTTPS is secure.
Could it be possible your proxy is not connecting because you don't need one?
I am not a proxy expert, so I apologize if I'm putting out something completely stupid.
I don't want to leave you completely empty-handed though. If you are looking for complete privacy, I would suggest a VPN. Both Windscribe and RiseUpVPN are free and encrypt all your data on your computer. (The desktop version, not the browser extension.)
While this is not a fully automated process, it is still very effective.
QUESTION
Can't Successfully Run AWS Glue Job That Reads From DynamoDB
Asked 2022-Feb-07 at 10:49I have successfully run crawlers that read my table in Dynamodb and also in AWS Reshift. The tables are now in the catalog. My problem is when running the Glue job to read the data from Dynamodb to Redshift. It doesnt seem to be able to read from Dynamodb. The error logs contain this
12022-02-01 10:16:55,821 WARN [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(69)): Lost task 0.0 in stage 0.0 (TID 0) (172.31.74.37 executor 1): java.lang.RuntimeException: Could not lookup table <TABLE-NAME> in DynamoDB.
2 at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:143)
3 at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
4 at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
5 at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
6 at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
7 at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
8 at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
9 at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
10 at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:610)
11 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
12 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
13 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
14 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
15 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
16 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
17 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
18 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
19 at org.apache.spark.scheduler.Task.run(Task.scala:131)
20 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
21 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
22 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
23 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
25 at java.lang.Thread.run(Thread.java:748)
26Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
27 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
28 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
29 at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:132)
30 ... 23 more
31Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
32 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207)
33 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153)
34 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
35 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
36 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
37 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
38 at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
39 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
40 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
41 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6164)
42 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6131)
43 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2228)
44 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2193)
45 at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:136)
46 at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:133)
47 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
48 ... 24 more
49Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
50 at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
51 at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
52 at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
53 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
54 at java.lang.reflect.Method.invoke(Method.java:498)
55 at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
56 at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
57 at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
58 at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
59 at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
60 at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
61 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
62 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
63 at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
64 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
65 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
66 ... 38 more
67Caused by: java.net.SocketTimeoutException: connect timed out
68 at java.net.PlainSocketImpl.socketConnect(Native Method)
69 at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
70 at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
71 at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
72 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
73 at java.net.Socket.connect(Socket.java:607)
74 at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
75 at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
76 at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
77 ... 53 more
78
and the complete logs contain this:
12022-02-01 10:16:55,821 WARN [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(69)): Lost task 0.0 in stage 0.0 (TID 0) (172.31.74.37 executor 1): java.lang.RuntimeException: Could not lookup table <TABLE-NAME> in DynamoDB.
2 at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:143)
3 at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
4 at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
5 at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
6 at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
7 at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
8 at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
9 at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
10 at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:610)
11 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
12 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
13 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
14 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
15 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
16 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
17 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
18 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
19 at org.apache.spark.scheduler.Task.run(Task.scala:131)
20 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
21 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
22 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
23 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
25 at java.lang.Thread.run(Thread.java:748)
26Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
27 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
28 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
29 at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:132)
30 ... 23 more
31Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
32 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207)
33 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153)
34 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
35 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
36 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
37 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
38 at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
39 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
40 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
41 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6164)
42 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6131)
43 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2228)
44 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2193)
45 at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:136)
46 at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:133)
47 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
48 ... 24 more
49Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
50 at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
51 at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
52 at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
53 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
54 at java.lang.reflect.Method.invoke(Method.java:498)
55 at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
56 at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
57 at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
58 at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
59 at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
60 at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
61 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
62 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
63 at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
64 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
65 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
66 ... 38 more
67Caused by: java.net.SocketTimeoutException: connect timed out
68 at java.net.PlainSocketImpl.socketConnect(Native Method)
69 at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
70 at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
71 at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
72 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
73 at java.net.Socket.connect(Socket.java:607)
74 at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
75 at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
76 at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
77 ... 53 more
7822/02/01 10:06:07 INFO GlueContext: Glue secret manager integration: secretId is not provided.
79
The role that Glue has been given has administrator access.
Below is the code for the script:
12022-02-01 10:16:55,821 WARN [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(69)): Lost task 0.0 in stage 0.0 (TID 0) (172.31.74.37 executor 1): java.lang.RuntimeException: Could not lookup table <TABLE-NAME> in DynamoDB.
2 at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:143)
3 at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
4 at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
5 at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
6 at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
7 at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
8 at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
9 at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
10 at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:610)
11 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
12 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
13 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
14 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
15 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
16 at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
17 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
18 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
19 at org.apache.spark.scheduler.Task.run(Task.scala:131)
20 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
21 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
22 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
23 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
25 at java.lang.Thread.run(Thread.java:748)
26Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
27 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
28 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
29 at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:132)
30 ... 23 more
31Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
32 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207)
33 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153)
34 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
35 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
36 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
37 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
38 at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
39 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
40 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
41 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6164)
42 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6131)
43 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2228)
44 at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2193)
45 at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:136)
46 at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:133)
47 at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
48 ... 24 more
49Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
50 at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
51 at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
52 at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
53 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
54 at java.lang.reflect.Method.invoke(Method.java:498)
55 at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
56 at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
57 at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
58 at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
59 at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
60 at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
61 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
62 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
63 at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
64 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
65 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
66 ... 38 more
67Caused by: java.net.SocketTimeoutException: connect timed out
68 at java.net.PlainSocketImpl.socketConnect(Native Method)
69 at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
70 at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
71 at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
72 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
73 at java.net.Socket.connect(Socket.java:607)
74 at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
75 at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
76 at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
77 ... 53 more
7822/02/01 10:06:07 INFO GlueContext: Glue secret manager integration: secretId is not provided.
79import sys
80from awsglue.transforms import *
81from awsglue.utils import getResolvedOptions
82from pyspark.context import SparkContext
83from awsglue.context import GlueContext
84from awsglue.job import Job
85
86args = getResolvedOptions(sys.argv, ["JOB_NAME"])
87sc = SparkContext()
88glueContext = GlueContext(sc)
89spark = glueContext.spark_session
90job = Job(glueContext)
91job.init(args["JOB_NAME"], args)
92
93# Script generated for node S3 bucket
94S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
95 database="db",
96 table_name="db_s3_table",
97 transformation_ctx="S3bucket_node1",
98)
99
100# Script generated for node ApplyMapping
101ApplyMapping_node2 = ApplyMapping.apply(
102 frame=S3bucket_node1,
103 mappings=[
104 ("column1.s", "string", "column1", "string"),
105 ("column2.n", "string", "column2", "long"),
106 ("column3.s", "string", "column3", "string"),
107 ("partition_0", "string", "partition0", "string"),
108 ],
109 transformation_ctx="ApplyMapping_node2",
110)
111
112# Script generated for node Redshift Cluster
113RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
114 frame=ApplyMapping_node2,
115 database="db",
116 table_name="db_redshift_db_schema_table",
117 redshift_tmp_dir=args["TempDir"],
118 transformation_ctx="RedshiftCluster_node3",
119)
120
121job.commit()
122
ANSWER
Answered 2022-Feb-07 at 10:49It seems that you were missing a VPC Endpoint for DynamoDB, since your Glue Jobs run in a private VPC when you write to Redshift.
QUESTION
Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
Asked 2022-Jan-22 at 16:39I have the following scrapy CrawlSpider
:
1import logger as lg
2from scrapy.crawler import CrawlerProcess
3from scrapy.http import Response
4from scrapy.spiders import CrawlSpider, Rule
5from scrapy_splash import SplashTextResponse
6from urllib.parse import urlencode
7from scrapy.linkextractors import LinkExtractor
8from scrapy.http import HtmlResponse
9
10logger = lg.get_logger("oddsportal_spider")
11
12
13class SeleniumScraper(CrawlSpider):
14
15 name = "splash"
16
17 custom_settings = {
18 "USER_AGENT": "*",
19 "LOG_LEVEL": "WARNING",
20 "DOWNLOADER_MIDDLEWARES": {
21 'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
22 },
23 }
24
25 httperror_allowed_codes = [301]
26
27 start_urls = ["https://www.oddsportal.com/tennis/results/"]
28
29 rules = (
30 Rule(
31 LinkExtractor(allow="/atp-buenos-aires/results/"),
32 callback="parse_tournament",
33 follow=True,
34 ),
35 Rule(
36 LinkExtractor(
37 allow="/tennis/",
38 restrict_xpaths=("//td[@class='name table-participant']//a"),
39 ),
40 callback="parse_match",
41 ),
42 )
43
44 def parse_tournament(self, response: Response):
45 logger.info(f"Parsing tournament - {response.url}")
46
47 def parse_match(self, response: Response):
48 logger.info(f"Parsing match - {response.url}")
49
50
51process = CrawlerProcess()
52process.crawl(SeleniumScraper)
53process.start()
54
The Selenium middleware is as follows:
1import logger as lg
2from scrapy.crawler import CrawlerProcess
3from scrapy.http import Response
4from scrapy.spiders import CrawlSpider, Rule
5from scrapy_splash import SplashTextResponse
6from urllib.parse import urlencode
7from scrapy.linkextractors import LinkExtractor
8from scrapy.http import HtmlResponse
9
10logger = lg.get_logger("oddsportal_spider")
11
12
13class SeleniumScraper(CrawlSpider):
14
15 name = "splash"
16
17 custom_settings = {
18 "USER_AGENT": "*",
19 "LOG_LEVEL": "WARNING",
20 "DOWNLOADER_MIDDLEWARES": {
21 'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
22 },
23 }
24
25 httperror_allowed_codes = [301]
26
27 start_urls = ["https://www.oddsportal.com/tennis/results/"]
28
29 rules = (
30 Rule(
31 LinkExtractor(allow="/atp-buenos-aires/results/"),
32 callback="parse_tournament",
33 follow=True,
34 ),
35 Rule(
36 LinkExtractor(
37 allow="/tennis/",
38 restrict_xpaths=("//td[@class='name table-participant']//a"),
39 ),
40 callback="parse_match",
41 ),
42 )
43
44 def parse_tournament(self, response: Response):
45 logger.info(f"Parsing tournament - {response.url}")
46
47 def parse_match(self, response: Response):
48 logger.info(f"Parsing match - {response.url}")
49
50
51process = CrawlerProcess()
52process.crawl(SeleniumScraper)
53process.start()
54class SeleniumMiddleware:
55
56 @classmethod
57 def from_crawler(cls, crawler):
58 middleware = cls()
59 crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
60 crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
61 return middleware
62
63 def process_request(self, request, spider):
64 logger.debug(f"Selenium processing request - {request.url}")
65 self.driver.get(request.url)
66 return HtmlResponse(
67 request.url,
68 body=self.driver.page_source,
69 encoding='utf-8',
70 request=request,
71 )
72
73 def spider_opened(self, spider):
74 options = webdriver.FirefoxOptions()
75 options.add_argument("--headless")
76 self.driver = webdriver.Firefox(
77 options=options,
78 executable_path=Path("/opt/geckodriver/geckodriver"),
79 )
80
81 def spider_closed(self, spider):
82 self.driver.close()
83
End to end this takes around a minute for around 50ish pages. To try and speed things up and take advantage of multiple threads and Javascript I've implemented the following scrapy_splash spider:
1import logger as lg
2from scrapy.crawler import CrawlerProcess
3from scrapy.http import Response
4from scrapy.spiders import CrawlSpider, Rule
5from scrapy_splash import SplashTextResponse
6from urllib.parse import urlencode
7from scrapy.linkextractors import LinkExtractor
8from scrapy.http import HtmlResponse
9
10logger = lg.get_logger("oddsportal_spider")
11
12
13class SeleniumScraper(CrawlSpider):
14
15 name = "splash"
16
17 custom_settings = {
18 "USER_AGENT": "*",
19 "LOG_LEVEL": "WARNING",
20 "DOWNLOADER_MIDDLEWARES": {
21 'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
22 },
23 }
24
25 httperror_allowed_codes = [301]
26
27 start_urls = ["https://www.oddsportal.com/tennis/results/"]
28
29 rules = (
30 Rule(
31 LinkExtractor(allow="/atp-buenos-aires/results/"),
32 callback="parse_tournament",
33 follow=True,
34 ),
35 Rule(
36 LinkExtractor(
37 allow="/tennis/",
38 restrict_xpaths=("//td[@class='name table-participant']//a"),
39 ),
40 callback="parse_match",
41 ),
42 )
43
44 def parse_tournament(self, response: Response):
45 logger.info(f"Parsing tournament - {response.url}")
46
47 def parse_match(self, response: Response):
48 logger.info(f"Parsing match - {response.url}")
49
50
51process = CrawlerProcess()
52process.crawl(SeleniumScraper)
53process.start()
54class SeleniumMiddleware:
55
56 @classmethod
57 def from_crawler(cls, crawler):
58 middleware = cls()
59 crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
60 crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
61 return middleware
62
63 def process_request(self, request, spider):
64 logger.debug(f"Selenium processing request - {request.url}")
65 self.driver.get(request.url)
66 return HtmlResponse(
67 request.url,
68 body=self.driver.page_source,
69 encoding='utf-8',
70 request=request,
71 )
72
73 def spider_opened(self, spider):
74 options = webdriver.FirefoxOptions()
75 options.add_argument("--headless")
76 self.driver = webdriver.Firefox(
77 options=options,
78 executable_path=Path("/opt/geckodriver/geckodriver"),
79 )
80
81 def spider_closed(self, spider):
82 self.driver.close()
83class SplashScraper(CrawlSpider):
84
85 name = "splash"
86
87 custom_settings = {
88 "USER_AGENT": "*",
89 "LOG_LEVEL": "WARNING",
90 "SPLASH_URL": "http://localhost:8050",
91 "DOWNLOADER_MIDDLEWARES": {
92 'scrapy_splash.SplashCookiesMiddleware': 723,
93 'scrapy_splash.SplashMiddleware': 725,
94 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
95 },
96 "SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
97 "DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
98 "HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
99 }
100
101 httperror_allowed_codes = [301]
102
103 start_urls = ["https://www.oddsportal.com/tennis/results/"]
104
105 rules = (
106 Rule(
107 LinkExtractor(allow="/atp-buenos-aires/results/"),
108 callback="parse_tournament",
109 process_request="use_splash",
110 follow=True,
111 ),
112 Rule(
113 LinkExtractor(
114 allow="/tennis/",
115 restrict_xpaths=("//td[@class='name table-participant']//a"),
116 ),
117 callback="parse_match",
118 process_request="use_splash",
119 ),
120 )
121
122 def process_links(self, links):
123 for link in links:
124 link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url})
125 return links
126
127 def _requests_to_follow(self, response):
128 if not isinstance(response, (HtmlResponse, SplashTextResponse)):
129 return
130 seen = set()
131 for rule_index, rule in enumerate(self._rules):
132 links = [lnk for lnk in rule.link_extractor.extract_links(response)
133 if lnk not in seen]
134 for link in rule.process_links(links):
135 seen.add(link)
136 request = self._build_request(rule_index, link)
137 yield rule.process_request(request, response)
138
139 def use_splash(self, request, response):
140 request.meta.update(splash={'endpoint': 'render.html'})
141 return request
142
143 def parse_tournament(self, response: Response):
144 logger.info(f"Parsing tournament - {response.url}")
145
146 def parse_match(self, response: Response):
147 logger.info(f"Parsing match - {response.url}")
148
However, this takes about the same amount of time. I was hoping to see a big increase in speed :(
I've tried playing around with different DOWNLOAD_DELAY
settings but that hasn't made things any faster.
All the concurrency settings are left at their defaults.
Any ideas on if/how I'm going wrong?
ANSWER
Answered 2022-Jan-22 at 16:39Taking a stab at an answer here with no experience of the libraries.
It looks like Scrapy Crawlers themselves are single-threaded. To get multi-threaded behavior you need to configure your application differently or write code that makes it behave mulit-threaded. It sounds like you've already tried this so this is probably not news to you but make sure you have configured the CONCURRENT_REQUESTS
and REACTOR_THREADPOOL_MAXSIZE
.
https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize
I can't imagine there is much CPU work going on in the crawling process so i doubt it's a GIL issue.
Excluding GIL as an option there are two possibilities here:
- Your crawler is not actually multi-threaded. This may be because you are missing some setup or configuration that would make it so. i.e. You may have set the env variables correctly but your crawler is written in a way that is processing requests for urls synchronously instead of submitting them to a queue.
To test this, create a global object and store a counter on it. Each time your crawler starts a request increment the counter. Each time your crawler finishes a request, decrement the counter. Then run a thread that prints the counter every second. If your counter value is always 1, then you are still running synchronously.
1import logger as lg
2from scrapy.crawler import CrawlerProcess
3from scrapy.http import Response
4from scrapy.spiders import CrawlSpider, Rule
5from scrapy_splash import SplashTextResponse
6from urllib.parse import urlencode
7from scrapy.linkextractors import LinkExtractor
8from scrapy.http import HtmlResponse
9
10logger = lg.get_logger("oddsportal_spider")
11
12
13class SeleniumScraper(CrawlSpider):
14
15 name = "splash"
16
17 custom_settings = {
18 "USER_AGENT": "*",
19 "LOG_LEVEL": "WARNING",
20 "DOWNLOADER_MIDDLEWARES": {
21 'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
22 },
23 }
24
25 httperror_allowed_codes = [301]
26
27 start_urls = ["https://www.oddsportal.com/tennis/results/"]
28
29 rules = (
30 Rule(
31 LinkExtractor(allow="/atp-buenos-aires/results/"),
32 callback="parse_tournament",
33 follow=True,
34 ),
35 Rule(
36 LinkExtractor(
37 allow="/tennis/",
38 restrict_xpaths=("//td[@class='name table-participant']//a"),
39 ),
40 callback="parse_match",
41 ),
42 )
43
44 def parse_tournament(self, response: Response):
45 logger.info(f"Parsing tournament - {response.url}")
46
47 def parse_match(self, response: Response):
48 logger.info(f"Parsing match - {response.url}")
49
50
51process = CrawlerProcess()
52process.crawl(SeleniumScraper)
53process.start()
54class SeleniumMiddleware:
55
56 @classmethod
57 def from_crawler(cls, crawler):
58 middleware = cls()
59 crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
60 crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
61 return middleware
62
63 def process_request(self, request, spider):
64 logger.debug(f"Selenium processing request - {request.url}")
65 self.driver.get(request.url)
66 return HtmlResponse(
67 request.url,
68 body=self.driver.page_source,
69 encoding='utf-8',
70 request=request,
71 )
72
73 def spider_opened(self, spider):
74 options = webdriver.FirefoxOptions()
75 options.add_argument("--headless")
76 self.driver = webdriver.Firefox(
77 options=options,
78 executable_path=Path("/opt/geckodriver/geckodriver"),
79 )
80
81 def spider_closed(self, spider):
82 self.driver.close()
83class SplashScraper(CrawlSpider):
84
85 name = "splash"
86
87 custom_settings = {
88 "USER_AGENT": "*",
89 "LOG_LEVEL": "WARNING",
90 "SPLASH_URL": "http://localhost:8050",
91 "DOWNLOADER_MIDDLEWARES": {
92 'scrapy_splash.SplashCookiesMiddleware': 723,
93 'scrapy_splash.SplashMiddleware': 725,
94 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
95 },
96 "SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
97 "DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
98 "HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
99 }
100
101 httperror_allowed_codes = [301]
102
103 start_urls = ["https://www.oddsportal.com/tennis/results/"]
104
105 rules = (
106 Rule(
107 LinkExtractor(allow="/atp-buenos-aires/results/"),
108 callback="parse_tournament",
109 process_request="use_splash",
110 follow=True,
111 ),
112 Rule(
113 LinkExtractor(
114 allow="/tennis/",
115 restrict_xpaths=("//td[@class='name table-participant']//a"),
116 ),
117 callback="parse_match",
118 process_request="use_splash",
119 ),
120 )
121
122 def process_links(self, links):
123 for link in links:
124 link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url})
125 return links
126
127 def _requests_to_follow(self, response):
128 if not isinstance(response, (HtmlResponse, SplashTextResponse)):
129 return
130 seen = set()
131 for rule_index, rule in enumerate(self._rules):
132 links = [lnk for lnk in rule.link_extractor.extract_links(response)
133 if lnk not in seen]
134 for link in rule.process_links(links):
135 seen.add(link)
136 request = self._build_request(rule_index, link)
137 yield rule.process_request(request, response)
138
139 def use_splash(self, request, response):
140 request.meta.update(splash={'endpoint': 'render.html'})
141 return request
142
143 def parse_tournament(self, response: Response):
144 logger.info(f"Parsing tournament - {response.url}")
145
146 def parse_match(self, response: Response):
147 logger.info(f"Parsing match - {response.url}")
148# global_state.py
149
150GLOBAL_STATE = {"counter": 0}
151
152# middleware.py
153
154from global_state import GLOBAL_STATE
155
156class SeleniumMiddleware:
157
158 def process_request(self, request, spider):
159 GLOBAL_STATE["counter"] += 1
160 self.driver.get(request.url)
161 GLOBAL_STATE["counter"] -= 1
162
163 ...
164
165# main.py
166
167from global_state import GLOBAL_STATE
168import threading
169import time
170
171def main():
172 gst = threading.Thread(target=gs_watcher)
173 gst.start()
174
175 # Start your app here
176
177def gs_watcher():
178 while True:
179 print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
180 time.sleep(1)
181
- The site you are crawling is rate limiting you.
To test this, run the application multiple times. If you go from 50 req/s to 25 req/s per application then you are being rate limited. To skirt around this use a VPN to hop-around.
If after that you find that you are running concurrent requests, and you are not being rate limited, then there is something funky going on in the libraries. Try removing chunks of code until you get to the bare minimum of what you need to crawl. If you have gotten to the absolute bare minimum implementation and it's still slow then you now have a minimal reproducible example and can get much better/informed help.
QUESTION
How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
Asked 2022-Jan-20 at 15:35I am working on certain stock-related projects where I have had a task to scrape all data on a daily basis for the last 5 years. i.e from 2016 to date. I particularly thought of using selenium because I can use crawler and bot to scrape the data based on the date. So I used the use of button click with selenium and now I want the same data that is displayed by the selenium browser to be fed by scrappy. This is the website I am working on right now. I have written the following code inside scrappy spider.
1class FloorSheetSpider(scrapy.Spider):
2 name = "nepse"
3
4 def start_requests(self):
5
6 driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
7
8
9 floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']
10
11 for date in floorsheet_dates:
12 driver.get(
13 "https://merolagani.com/Floorsheet.aspx")
14
15 driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
16 ).send_keys(date)
17 driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
18 total_length = driver.find_element(By.XPATH,
19 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
20 z = int((total_length.split()[-1]).replace(']', ''))
21 for data in range(z, z + 1):
22 driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
23 self.url = driver.page_source
24 yield Request(url=self.url, callback=self.parse)
25
26
27 def parse(self, response, **kwargs):
28 for value in response.xpath('//tbody/tr'):
29 print(value.css('td::text').extract()[1])
30 print("ok"*200)
31
Update: Error after answer is
1class FloorSheetSpider(scrapy.Spider):
2 name = "nepse"
3
4 def start_requests(self):
5
6 driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
7
8
9 floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']
10
11 for date in floorsheet_dates:
12 driver.get(
13 "https://merolagani.com/Floorsheet.aspx")
14
15 driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
16 ).send_keys(date)
17 driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
18 total_length = driver.find_element(By.XPATH,
19 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
20 z = int((total_length.split()[-1]).replace(']', ''))
21 for data in range(z, z + 1):
22 driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
23 self.url = driver.page_source
24 yield Request(url=self.url, callback=self.parse)
25
26
27 def parse(self, response, **kwargs):
28 for value in response.xpath('//tbody/tr'):
29 print(value.css('td::text').extract()[1])
30 print("ok"*200)
312022-01-14 14:11:36 [twisted] CRITICAL:
32Traceback (most recent call last):
33 File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
34 result = current_context.run(gen.send, result)
35 File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
36 start_requests = iter(self.spider.start_requests())
37TypeError: 'NoneType' object is not iterable
38
I want to send current web html content to scrapy feeder but I am getting unusal error for past 2 days any help or suggestions will be very much appreciated.
ANSWER
Answered 2022-Jan-14 at 09:30The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.
Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):
1class FloorSheetSpider(scrapy.Spider):
2 name = "nepse"
3
4 def start_requests(self):
5
6 driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
7
8
9 floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']
10
11 for date in floorsheet_dates:
12 driver.get(
13 "https://merolagani.com/Floorsheet.aspx")
14
15 driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
16 ).send_keys(date)
17 driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
18 total_length = driver.find_element(By.XPATH,
19 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
20 z = int((total_length.split()[-1]).replace(']', ''))
21 for data in range(z, z + 1):
22 driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
23 self.url = driver.page_source
24 yield Request(url=self.url, callback=self.parse)
25
26
27 def parse(self, response, **kwargs):
28 for value in response.xpath('//tbody/tr'):
29 print(value.css('td::text').extract()[1])
30 print("ok"*200)
312022-01-14 14:11:36 [twisted] CRITICAL:
32Traceback (most recent call last):
33 File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
34 result = current_context.run(gen.send, result)
35 File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
36 start_requests = iter(self.spider.start_requests())
37TypeError: 'NoneType' object is not iterable
38import scrapy
39from selenium import webdriver
40from selenium.webdriver.common.by import By
41from scrapy.http import HtmlResponse
42
43
44class FloorSheetSpider(scrapy.Spider):
45 name = "nepse"
46
47 def start_requests(self):
48
49 # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
50 driver = webdriver.Chrome()
51
52 floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
53
54 for date in floorsheet_dates:
55 driver.get(
56 "https://merolagani.com/Floorsheet.aspx")
57
58 driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
59 ).send_keys(date)
60 driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
61 total_length = driver.find_element(By.XPATH,
62 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
63 z = int((total_length.split()[-1]).replace(']', ''))
64 for data in range(1, z + 1):
65 driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
66 self.body = driver.page_source
67
68 response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
69 for value in response.xpath('//tbody/tr'):
70 print(value.css('td::text').extract()[1])
71 print("ok"*200)
72
73 # return an empty requests list
74 return []
75
Solution 2 - with super simple downloader middleware:
(You might have a delay here in parse
method so be patient).
1class FloorSheetSpider(scrapy.Spider):
2 name = "nepse"
3
4 def start_requests(self):
5
6 driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
7
8
9 floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']
10
11 for date in floorsheet_dates:
12 driver.get(
13 "https://merolagani.com/Floorsheet.aspx")
14
15 driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
16 ).send_keys(date)
17 driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
18 total_length = driver.find_element(By.XPATH,
19 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
20 z = int((total_length.split()[-1]).replace(']', ''))
21 for data in range(z, z + 1):
22 driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
23 self.url = driver.page_source
24 yield Request(url=self.url, callback=self.parse)
25
26
27 def parse(self, response, **kwargs):
28 for value in response.xpath('//tbody/tr'):
29 print(value.css('td::text').extract()[1])
30 print("ok"*200)
312022-01-14 14:11:36 [twisted] CRITICAL:
32Traceback (most recent call last):
33 File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
34 result = current_context.run(gen.send, result)
35 File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
36 start_requests = iter(self.spider.start_requests())
37TypeError: 'NoneType' object is not iterable
38import scrapy
39from selenium import webdriver
40from selenium.webdriver.common.by import By
41from scrapy.http import HtmlResponse
42
43
44class FloorSheetSpider(scrapy.Spider):
45 name = "nepse"
46
47 def start_requests(self):
48
49 # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
50 driver = webdriver.Chrome()
51
52 floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
53
54 for date in floorsheet_dates:
55 driver.get(
56 "https://merolagani.com/Floorsheet.aspx")
57
58 driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
59 ).send_keys(date)
60 driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
61 total_length = driver.find_element(By.XPATH,
62 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
63 z = int((total_length.split()[-1]).replace(']', ''))
64 for data in range(1, z + 1):
65 driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
66 self.body = driver.page_source
67
68 response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
69 for value in response.xpath('//tbody/tr'):
70 print(value.css('td::text').extract()[1])
71 print("ok"*200)
72
73 # return an empty requests list
74 return []
75import scrapy
76from scrapy import Request
77from scrapy.http import HtmlResponse
78from selenium import webdriver
79from selenium.webdriver.common.by import By
80
81
82class SeleniumMiddleware(object):
83 def process_request(self, request, spider):
84 url = spider.driver.current_url
85 body = spider.driver.page_source
86 return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)
87
88
89class FloorSheetSpider(scrapy.Spider):
90 name = "nepse"
91
92 custom_settings = {
93 'DOWNLOADER_MIDDLEWARES': {
94 'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
95 # 'projects_name.path.to.your.pipeline': 543
96 }
97 }
98 driver = webdriver.Chrome()
99
100 def start_requests(self):
101
102 # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
103
104
105 floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
106
107 for date in floorsheet_dates:
108 self.driver.get(
109 "https://merolagani.com/Floorsheet.aspx")
110
111 self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
112 ).send_keys(date)
113 self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
114 total_length = self.driver.find_element(By.XPATH,
115 "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
116 z = int((total_length.split()[-1]).replace(']', ''))
117 for data in range(1, z + 1):
118 self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
119 self.body = self.driver.page_source
120 self.url = self.driver.current_url
121
122 yield Request(url=self.url, callback=self.parse, dont_filter=True)
123
124 def parse(self, response, **kwargs):
125 print('test ok')
126 for value in response.xpath('//tbody/tr'):
127 print(value.css('td::text').extract()[1])
128 print("ok"*200)
129
Notice that I've used chrome so change it back to firefox like in your original code.
QUESTION
How to set class variable through __init__ in Python?
Asked 2021-Nov-08 at 20:06I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method, but I could not figure out how to change the class varible "delay" from within the init method.
Example minimal:
1class testSpider(CrawlSpider):
2
3 custom_settings = {
4 'DOWNLOAD_DELAY': 10, # default value
5 }
6
7 """ get arguments passed over CLI
8 scrapyd usage: -d arg1=val1
9 scrapy usage: -a arg1=val1
10 """
11 def __init__(self, *args, **kwargs):
12 super(testSpider, self).__init__(*args, **kwargs)
13
14 self.delay = kwargs.get('delay')
15
16 if self.delay:
17 testSpider.custom_settings['DOWNLOAD_DELAY'] = self.delay
18 print('init:', testSpider.custom_settings['DOWNLOAD_DELAY'])
19
20 print(custom_settings['DOWNLOAD_DELAY'])
21
This will not change the setting unfortunatelly:
1class testSpider(CrawlSpider):
2
3 custom_settings = {
4 'DOWNLOAD_DELAY': 10, # default value
5 }
6
7 """ get arguments passed over CLI
8 scrapyd usage: -d arg1=val1
9 scrapy usage: -a arg1=val1
10 """
11 def __init__(self, *args, **kwargs):
12 super(testSpider, self).__init__(*args, **kwargs)
13
14 self.delay = kwargs.get('delay')
15
16 if self.delay:
17 testSpider.custom_settings['DOWNLOAD_DELAY'] = self.delay
18 print('init:', testSpider.custom_settings['DOWNLOAD_DELAY'])
19
20 print(custom_settings['DOWNLOAD_DELAY'])
21scrapy crawl test -a delay=5
2210
23init: 5
24
How can the class variable be changed?
ANSWER
Answered 2021-Nov-08 at 20:06I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method...
...
scrapy crawl test -a delay=5
According to scrapy docs. (Settings/Command line options section) it is requred to use
-s
parameter to update setting
scrapy crawl test -s DOWNLOAD_DELAY=5
It is not possible to update settings during runtime in spider code from
init
or other methods (details in related discussion on github Update spider settings during runtime #4196
QUESTION
headless chrome on docker M1 error - unable to discover open window in chrome
Asked 2021-Nov-04 at 08:22I'm currently trying to run headless chrome with selenium on m1 mac host / amd64 ubuntu container.
Because arm ubuntu does not support google-chrome-stable package, I decided to use amd64 ubuntu base image.
But it does not work. getting some error.
1worker_1 | [2021-10-31 03:58:23,286: DEBUG/ForkPoolWorker-10] POST http://localhost:43035/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}
2worker_1 | [2021-10-31 03:58:23,330: DEBUG/ForkPoolWorker-10] Starting new HTTP connection (1): localhost:43035
3worker_1 | [2021-10-31 03:58:41,311: DEBUG/ForkPoolWorker-12] http://localhost:47089 "POST /session HTTP/1.1" 500 717
4worker_1 | [2021-10-31 03:58:41,412: DEBUG/ForkPoolWorker-12] Finished Request
5worker_1 | [2021-10-31 03:58:41,825: WARNING/ForkPoolWorker-12] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
6worker_1 | (Session info: headless chrome=95.0.4638.69)
7worker_1 | Stacktrace:
8worker_1 | #0 0x004000a18f93 <unknown>
9worker_1 | #1 0x0040004f3908 <unknown>
10worker_1 | #2 0x0040004d3cdf <unknown>
11worker_1 | #3 0x00400054cabe <unknown>
12worker_1 | #4 0x004000546973 <unknown>
13worker_1 | #5 0x00400051cdf4 <unknown>
14worker_1 | #6 0x00400051dde5 <unknown>
15worker_1 | #7 0x004000a482be <unknown>
16worker_1 | #8 0x004000a5dba0 <unknown>
17worker_1 | #9 0x004000a49215 <unknown>
18worker_1 | #10 0x004000a5efe8 <unknown>
19worker_1 | #11 0x004000a3d9db <unknown>
20worker_1 | #12 0x004000a7a218 <unknown>
21worker_1 | #13 0x004000a7a398 <unknown>
22worker_1 | #14 0x004000a956cd <unknown>
23worker_1 | #15 0x004002b29609 <unknown>
24worker_1 |
25worker_1 | [2021-10-31 03:58:41,826: WARNING/ForkPoolWorker-12]
26worker_1 |
27worker_1 | [2021-10-31 03:58:41,867: DEBUG/ForkPoolWorker-11] http://localhost:58147 "POST /session HTTP/1.1" 500 717
28worker_1 | [2021-10-31 03:58:41,907: DEBUG/ForkPoolWorker-11] Finished Request
29worker_1 | [2021-10-31 03:58:41,946: DEBUG/ForkPoolWorker-12] Using selector: EpollSelector
30worker_1 | [WDM] -
31worker_1 |
32worker_1 | [2021-10-31 03:58:41,962: INFO/ForkPoolWorker-12]
33worker_1 |
34worker_1 | [WDM] - ====== WebDriver manager ======
35worker_1 | [2021-10-31 03:58:41,971: INFO/ForkPoolWorker-12] ====== WebDriver manager ======
36worker_1 | [2021-10-31 03:58:42,112: WARNING/ForkPoolWorker-11] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
37worker_1 | (Session info: headless chrome=95.0.4638.69)
38worker_1 | Stacktrace:
39worker_1 | #0 0x004000a18f93 <unknown>
40worker_1 | #1 0x0040004f3908 <unknown>
41worker_1 | #2 0x0040004d3cdf <unknown>
42worker_1 | #3 0x00400054cabe <unknown>
43worker_1 | #4 0x004000546973 <unknown>
44worker_1 | #5 0x00400051cdf4 <unknown>
45worker_1 | #6 0x00400051dde5 <unknown>
46worker_1 | #7 0x004000a482be <unknown>
47worker_1 | #8 0x004000a5dba0 <unknown>
48worker_1 | #9 0x004000a49215 <unknown>
49worker_1 | #10 0x004000a5efe8 <unknown>
50worker_1 | #11 0x004000a3d9db <unknown>
51worker_1 | #12 0x004000a7a218 <unknown>
52worker_1 | #13 0x004000a7a398 <unknown>
53worker_1 | #14 0x004000a956cd <unknown>
54worker_1 | #15 0x004002b29609 <unknown>
55worker_1 |
56worker_1 | [2021-10-31 03:58:42,113: WARNING/ForkPoolWorker-11]
57worker_1 |
58worker_1 | [2021-10-31 03:58:42,166: DEBUG/ForkPoolWorker-11] Using selector: EpollSelector
59worker_1 | [WDM] -
60worker_1 |
61worker_1 | [2021-10-31 03:58:42,169: INFO/ForkPoolWorker-11]
62worker_1 |
63worker_1 | [WDM] - ====== WebDriver manager ======
64worker_1 | [2021-10-31 03:58:42,170: INFO/ForkPoolWorker-11] ====== WebDriver manager ======
65worker_1 | [2021-10-31 03:58:42,702: DEBUG/ForkPoolWorker-9] http://localhost:51793 "POST /session HTTP/1.1" 500 866
66worker_1 | [2021-10-31 03:58:42,719: DEBUG/ForkPoolWorker-9] Finished Request
67worker_1 | [2021-10-31 03:58:42,986: WARNING/ForkPoolWorker-9] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
68worker_1 | (chrome not reachable)
69worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
70worker_1 | Stacktrace:
71worker_1 | #0 0x004000a18f93 <unknown>
72worker_1 | #1 0x0040004f3908 <unknown>
73worker_1 | #2 0x004000516b32 <unknown>
74worker_1 | #3 0x00400051265d <unknown>
75worker_1 | #4 0x00400054c770 <unknown>
76worker_1 | #5 0x004000546973 <unknown>
77worker_1 | #6 0x00400051cdf4 <unknown>
78worker_1 | #7 0x00400051dde5 <unknown>
79worker_1 | #8 0x004000a482be <unknown>
80worker_1 | #9 0x004000a5dba0 <unknown>
81worker_1 | #10 0x004000a49215 <unknown>
82worker_1 | #11 0x004000a5efe8 <unknown>
83worker_1 | #12 0x004000a3d9db <unknown>
84worker_1 | #13 0x004000a7a218 <unknown>
85worker_1 | #14 0x004000a7a398 <unknown>
86worker_1 | #15 0x004000a956cd <unknown>
87worker_1 | #16 0x004002b29609 <unknown>
88worker_1 |
89worker_1 | [2021-10-31 03:58:42,987: WARNING/ForkPoolWorker-9]
90worker_1 |
91worker_1 | [2021-10-31 03:58:43,045: DEBUG/ForkPoolWorker-9] Using selector: EpollSelector
92worker_1 | [WDM] -
93worker_1 |
94worker_1 | [2021-10-31 03:58:43,049: INFO/ForkPoolWorker-9]
95worker_1 |
96worker_1 | [WDM] - ====== WebDriver manager ======
97worker_1 | [2021-10-31 03:58:43,050: INFO/ForkPoolWorker-9] ====== WebDriver manager ======
98worker_1 | [2021-10-31 03:58:43,936: DEBUG/ForkPoolWorker-10] http://localhost:43035 "POST /session HTTP/1.1" 500 866
99worker_1 | [2021-10-31 03:58:43,952: DEBUG/ForkPoolWorker-10] Finished Request
100worker_1 | [2021-10-31 03:58:44,163: WARNING/ForkPoolWorker-10] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
101worker_1 | (chrome not reachable)
102worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
103worker_1 | Stacktrace:
104worker_1 | #0 0x004000a18f93 <unknown>
105worker_1 | #1 0x0040004f3908 <unknown>
106worker_1 | #2 0x004000516b32 <unknown>
107worker_1 | #3 0x00400051265d <unknown>
108worker_1 | #4 0x00400054c770 <unknown>
109worker_1 | #5 0x004000546973 <unknown>
110worker_1 | #6 0x00400051cdf4 <unknown>
111worker_1 | #7 0x00400051dde5 <unknown>
112worker_1 | #8 0x004000a482be <unknown>
113worker_1 | #9 0x004000a5dba0 <unknown>
114worker_1 | #10 0x004000a49215 <unknown>
115worker_1 | #11 0x004000a5efe8 <unknown>
116worker_1 | #12 0x004000a3d9db <unknown>
117worker_1 | #13 0x004000a7a218 <unknown>
118worker_1 | #14 0x004000a7a398 <unknown>
119worker_1 | #15 0x004000a956cd <unknown>
120worker_1 | #16 0x004002b29609 <unknown>
121worker_1 |
122worker_1 | [2021-10-31 03:58:44,164: WARNING/ForkPoolWorker-10]
123worker_1 |
124worker_1 | [2021-10-31 03:58:44,205: DEBUG/ForkPoolWorker-10] Using selector: EpollSelector
125worker_1 | [WDM] -
126worker_1 |
127worker_1 | [2021-10-31 03:58:44,215: INFO/ForkPoolWorker-10]
128worker_1 |
129worker_1 | [WDM] - ====== WebDriver manager ======
130worker_1 | [2021-10-31 03:58:44,217: INFO/ForkPoolWorker-10] ====== WebDriver manager ======
131worker_1 | [WDM] - Current google-chrome version is 95.0.4638
132worker_1 | [2021-10-31 03:58:44,520: INFO/ForkPoolWorker-12] Current google-chrome version is 95.0.4638
133worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
134worker_1 | [2021-10-31 03:58:44,525: INFO/ForkPoolWorker-12] Get LATEST driver version for 95.0.4638
135worker_1 | [WDM] - Current google-chrome version is 95.0.4638
136worker_1 | [2021-10-31 03:58:44,590: INFO/ForkPoolWorker-11] Current google-chrome version is 95.0.4638
137worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
138worker_1 | [2021-10-31 03:58:44,593: INFO/ForkPoolWorker-11] Get LATEST driver version for 95.0.4638
139worker_1 | [2021-10-31 03:58:44,599: DEBUG/ForkPoolWorker-12] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
140worker_1 | [2021-10-31 03:58:44,826: DEBUG/ForkPoolWorker-11] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
141worker_1 | [2021-10-31 03:58:45,205: DEBUG/ForkPoolWorker-11] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
142worker_1 | [2021-10-31 03:58:45,213: DEBUG/ForkPoolWorker-12] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
143worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
144worker_1 | [2021-10-31 03:58:45,219: INFO/ForkPoolWorker-11] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
145worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
146worker_1 | [2021-10-31 03:58:45,242: INFO/ForkPoolWorker-12] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
147worker_1 | [WDM] - Current google-chrome version is 95.0.4638
148worker_1 | [2021-10-31 03:58:45,603: INFO/ForkPoolWorker-9] Current google-chrome version is 95.0.4638
149worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
150worker_1 | [2021-10-31 03:58:45,610: INFO/ForkPoolWorker-9] Get LATEST driver version for 95.0.4638
151
similar logs are looped.
when I tried to launch chrome on docker container, this error occurs.
1worker_1 | [2021-10-31 03:58:23,286: DEBUG/ForkPoolWorker-10] POST http://localhost:43035/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}
2worker_1 | [2021-10-31 03:58:23,330: DEBUG/ForkPoolWorker-10] Starting new HTTP connection (1): localhost:43035
3worker_1 | [2021-10-31 03:58:41,311: DEBUG/ForkPoolWorker-12] http://localhost:47089 "POST /session HTTP/1.1" 500 717
4worker_1 | [2021-10-31 03:58:41,412: DEBUG/ForkPoolWorker-12] Finished Request
5worker_1 | [2021-10-31 03:58:41,825: WARNING/ForkPoolWorker-12] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
6worker_1 | (Session info: headless chrome=95.0.4638.69)
7worker_1 | Stacktrace:
8worker_1 | #0 0x004000a18f93 <unknown>
9worker_1 | #1 0x0040004f3908 <unknown>
10worker_1 | #2 0x0040004d3cdf <unknown>
11worker_1 | #3 0x00400054cabe <unknown>
12worker_1 | #4 0x004000546973 <unknown>
13worker_1 | #5 0x00400051cdf4 <unknown>
14worker_1 | #6 0x00400051dde5 <unknown>
15worker_1 | #7 0x004000a482be <unknown>
16worker_1 | #8 0x004000a5dba0 <unknown>
17worker_1 | #9 0x004000a49215 <unknown>
18worker_1 | #10 0x004000a5efe8 <unknown>
19worker_1 | #11 0x004000a3d9db <unknown>
20worker_1 | #12 0x004000a7a218 <unknown>
21worker_1 | #13 0x004000a7a398 <unknown>
22worker_1 | #14 0x004000a956cd <unknown>
23worker_1 | #15 0x004002b29609 <unknown>
24worker_1 |
25worker_1 | [2021-10-31 03:58:41,826: WARNING/ForkPoolWorker-12]
26worker_1 |
27worker_1 | [2021-10-31 03:58:41,867: DEBUG/ForkPoolWorker-11] http://localhost:58147 "POST /session HTTP/1.1" 500 717
28worker_1 | [2021-10-31 03:58:41,907: DEBUG/ForkPoolWorker-11] Finished Request
29worker_1 | [2021-10-31 03:58:41,946: DEBUG/ForkPoolWorker-12] Using selector: EpollSelector
30worker_1 | [WDM] -
31worker_1 |
32worker_1 | [2021-10-31 03:58:41,962: INFO/ForkPoolWorker-12]
33worker_1 |
34worker_1 | [WDM] - ====== WebDriver manager ======
35worker_1 | [2021-10-31 03:58:41,971: INFO/ForkPoolWorker-12] ====== WebDriver manager ======
36worker_1 | [2021-10-31 03:58:42,112: WARNING/ForkPoolWorker-11] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
37worker_1 | (Session info: headless chrome=95.0.4638.69)
38worker_1 | Stacktrace:
39worker_1 | #0 0x004000a18f93 <unknown>
40worker_1 | #1 0x0040004f3908 <unknown>
41worker_1 | #2 0x0040004d3cdf <unknown>
42worker_1 | #3 0x00400054cabe <unknown>
43worker_1 | #4 0x004000546973 <unknown>
44worker_1 | #5 0x00400051cdf4 <unknown>
45worker_1 | #6 0x00400051dde5 <unknown>
46worker_1 | #7 0x004000a482be <unknown>
47worker_1 | #8 0x004000a5dba0 <unknown>
48worker_1 | #9 0x004000a49215 <unknown>
49worker_1 | #10 0x004000a5efe8 <unknown>
50worker_1 | #11 0x004000a3d9db <unknown>
51worker_1 | #12 0x004000a7a218 <unknown>
52worker_1 | #13 0x004000a7a398 <unknown>
53worker_1 | #14 0x004000a956cd <unknown>
54worker_1 | #15 0x004002b29609 <unknown>
55worker_1 |
56worker_1 | [2021-10-31 03:58:42,113: WARNING/ForkPoolWorker-11]
57worker_1 |
58worker_1 | [2021-10-31 03:58:42,166: DEBUG/ForkPoolWorker-11] Using selector: EpollSelector
59worker_1 | [WDM] -
60worker_1 |
61worker_1 | [2021-10-31 03:58:42,169: INFO/ForkPoolWorker-11]
62worker_1 |
63worker_1 | [WDM] - ====== WebDriver manager ======
64worker_1 | [2021-10-31 03:58:42,170: INFO/ForkPoolWorker-11] ====== WebDriver manager ======
65worker_1 | [2021-10-31 03:58:42,702: DEBUG/ForkPoolWorker-9] http://localhost:51793 "POST /session HTTP/1.1" 500 866
66worker_1 | [2021-10-31 03:58:42,719: DEBUG/ForkPoolWorker-9] Finished Request
67worker_1 | [2021-10-31 03:58:42,986: WARNING/ForkPoolWorker-9] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
68worker_1 | (chrome not reachable)
69worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
70worker_1 | Stacktrace:
71worker_1 | #0 0x004000a18f93 <unknown>
72worker_1 | #1 0x0040004f3908 <unknown>
73worker_1 | #2 0x004000516b32 <unknown>
74worker_1 | #3 0x00400051265d <unknown>
75worker_1 | #4 0x00400054c770 <unknown>
76worker_1 | #5 0x004000546973 <unknown>
77worker_1 | #6 0x00400051cdf4 <unknown>
78worker_1 | #7 0x00400051dde5 <unknown>
79worker_1 | #8 0x004000a482be <unknown>
80worker_1 | #9 0x004000a5dba0 <unknown>
81worker_1 | #10 0x004000a49215 <unknown>
82worker_1 | #11 0x004000a5efe8 <unknown>
83worker_1 | #12 0x004000a3d9db <unknown>
84worker_1 | #13 0x004000a7a218 <unknown>
85worker_1 | #14 0x004000a7a398 <unknown>
86worker_1 | #15 0x004000a956cd <unknown>
87worker_1 | #16 0x004002b29609 <unknown>
88worker_1 |
89worker_1 | [2021-10-31 03:58:42,987: WARNING/ForkPoolWorker-9]
90worker_1 |
91worker_1 | [2021-10-31 03:58:43,045: DEBUG/ForkPoolWorker-9] Using selector: EpollSelector
92worker_1 | [WDM] -
93worker_1 |
94worker_1 | [2021-10-31 03:58:43,049: INFO/ForkPoolWorker-9]
95worker_1 |
96worker_1 | [WDM] - ====== WebDriver manager ======
97worker_1 | [2021-10-31 03:58:43,050: INFO/ForkPoolWorker-9] ====== WebDriver manager ======
98worker_1 | [2021-10-31 03:58:43,936: DEBUG/ForkPoolWorker-10] http://localhost:43035 "POST /session HTTP/1.1" 500 866
99worker_1 | [2021-10-31 03:58:43,952: DEBUG/ForkPoolWorker-10] Finished Request
100worker_1 | [2021-10-31 03:58:44,163: WARNING/ForkPoolWorker-10] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
101worker_1 | (chrome not reachable)
102worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
103worker_1 | Stacktrace:
104worker_1 | #0 0x004000a18f93 <unknown>
105worker_1 | #1 0x0040004f3908 <unknown>
106worker_1 | #2 0x004000516b32 <unknown>
107worker_1 | #3 0x00400051265d <unknown>
108worker_1 | #4 0x00400054c770 <unknown>
109worker_1 | #5 0x004000546973 <unknown>
110worker_1 | #6 0x00400051cdf4 <unknown>
111worker_1 | #7 0x00400051dde5 <unknown>
112worker_1 | #8 0x004000a482be <unknown>
113worker_1 | #9 0x004000a5dba0 <unknown>
114worker_1 | #10 0x004000a49215 <unknown>
115worker_1 | #11 0x004000a5efe8 <unknown>
116worker_1 | #12 0x004000a3d9db <unknown>
117worker_1 | #13 0x004000a7a218 <unknown>
118worker_1 | #14 0x004000a7a398 <unknown>
119worker_1 | #15 0x004000a956cd <unknown>
120worker_1 | #16 0x004002b29609 <unknown>
121worker_1 |
122worker_1 | [2021-10-31 03:58:44,164: WARNING/ForkPoolWorker-10]
123worker_1 |
124worker_1 | [2021-10-31 03:58:44,205: DEBUG/ForkPoolWorker-10] Using selector: EpollSelector
125worker_1 | [WDM] -
126worker_1 |
127worker_1 | [2021-10-31 03:58:44,215: INFO/ForkPoolWorker-10]
128worker_1 |
129worker_1 | [WDM] - ====== WebDriver manager ======
130worker_1 | [2021-10-31 03:58:44,217: INFO/ForkPoolWorker-10] ====== WebDriver manager ======
131worker_1 | [WDM] - Current google-chrome version is 95.0.4638
132worker_1 | [2021-10-31 03:58:44,520: INFO/ForkPoolWorker-12] Current google-chrome version is 95.0.4638
133worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
134worker_1 | [2021-10-31 03:58:44,525: INFO/ForkPoolWorker-12] Get LATEST driver version for 95.0.4638
135worker_1 | [WDM] - Current google-chrome version is 95.0.4638
136worker_1 | [2021-10-31 03:58:44,590: INFO/ForkPoolWorker-11] Current google-chrome version is 95.0.4638
137worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
138worker_1 | [2021-10-31 03:58:44,593: INFO/ForkPoolWorker-11] Get LATEST driver version for 95.0.4638
139worker_1 | [2021-10-31 03:58:44,599: DEBUG/ForkPoolWorker-12] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
140worker_1 | [2021-10-31 03:58:44,826: DEBUG/ForkPoolWorker-11] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
141worker_1 | [2021-10-31 03:58:45,205: DEBUG/ForkPoolWorker-11] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
142worker_1 | [2021-10-31 03:58:45,213: DEBUG/ForkPoolWorker-12] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
143worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
144worker_1 | [2021-10-31 03:58:45,219: INFO/ForkPoolWorker-11] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
145worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
146worker_1 | [2021-10-31 03:58:45,242: INFO/ForkPoolWorker-12] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
147worker_1 | [WDM] - Current google-chrome version is 95.0.4638
148worker_1 | [2021-10-31 03:58:45,603: INFO/ForkPoolWorker-9] Current google-chrome version is 95.0.4638
149worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
150worker_1 | [2021-10-31 03:58:45,610: INFO/ForkPoolWorker-9] Get LATEST driver version for 95.0.4638
151ubuntu@742a62c61201:/backend$ google-chrome --no-sandbox --disable-dev-shm-usage --disable-gpu --remote-debugging-port=9222 --headless
152qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
153qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
154[1031/041139.297323:ERROR:bus.cc(392)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
155[1031/041139.310612:ERROR:file_path_watcher_linux.cc(326)] inotify_init() failed: Function not implemented (38)
156
157DevTools listening on ws://127.0.0.1:9222/devtools/browser/32b15b93-3fe0-4cb8-9c96-8aea011686a8
158qemu: unknown option 'type=utility'
159[1031/041139.463057:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
160[1031/041139.463227:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 1 time(s)
161[1031/041139.543335:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
162qemu: unknown option 'type=utility'
163[1031/041139.718793:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
164[1031/041139.718877:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 2 time(s)
165[1031/041139.736641:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
166qemu: unknown option 'type=utility'
167[1031/041139.788529:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
168[1031/041139.788615:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 3 time(s)
169[1031/041139.798487:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
170[1031/041139.808256:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
171[1031/041139.808372:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 4 time(s)
172qemu: unknown option 'type=utility'
173[1031/041139.825267:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
174[1031/041139.825354:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 5 time(s)
175[1031/041139.830175:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
176[1031/041139.839159:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
177[1031/041139.839345:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 6 time(s)
178[1031/041139.839816:FATAL:gpu_data_manager_impl_private.cc(417)] GPU process isn't usable. Goodbye.
179qemu: uncaught target signal 11 (Segmentation fault) - core dumped
180Segmentation fault
181ubuntu@742a62c61201:/backend$ qemu: unknown option 'type=utility'
182
183ubuntu@742a62c61201:/backend$
184
Maybe this issue related? https://github.com/docker/for-mac/issues/5766
If so, there's no way to dockerize headless chrome using m1?
celery worker Dockerfile
1worker_1 | [2021-10-31 03:58:23,286: DEBUG/ForkPoolWorker-10] POST http://localhost:43035/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}
2worker_1 | [2021-10-31 03:58:23,330: DEBUG/ForkPoolWorker-10] Starting new HTTP connection (1): localhost:43035
3worker_1 | [2021-10-31 03:58:41,311: DEBUG/ForkPoolWorker-12] http://localhost:47089 "POST /session HTTP/1.1" 500 717
4worker_1 | [2021-10-31 03:58:41,412: DEBUG/ForkPoolWorker-12] Finished Request
5worker_1 | [2021-10-31 03:58:41,825: WARNING/ForkPoolWorker-12] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
6worker_1 | (Session info: headless chrome=95.0.4638.69)
7worker_1 | Stacktrace:
8worker_1 | #0 0x004000a18f93 <unknown>
9worker_1 | #1 0x0040004f3908 <unknown>
10worker_1 | #2 0x0040004d3cdf <unknown>
11worker_1 | #3 0x00400054cabe <unknown>
12worker_1 | #4 0x004000546973 <unknown>
13worker_1 | #5 0x00400051cdf4 <unknown>
14worker_1 | #6 0x00400051dde5 <unknown>
15worker_1 | #7 0x004000a482be <unknown>
16worker_1 | #8 0x004000a5dba0 <unknown>
17worker_1 | #9 0x004000a49215 <unknown>
18worker_1 | #10 0x004000a5efe8 <unknown>
19worker_1 | #11 0x004000a3d9db <unknown>
20worker_1 | #12 0x004000a7a218 <unknown>
21worker_1 | #13 0x004000a7a398 <unknown>
22worker_1 | #14 0x004000a956cd <unknown>
23worker_1 | #15 0x004002b29609 <unknown>
24worker_1 |
25worker_1 | [2021-10-31 03:58:41,826: WARNING/ForkPoolWorker-12]
26worker_1 |
27worker_1 | [2021-10-31 03:58:41,867: DEBUG/ForkPoolWorker-11] http://localhost:58147 "POST /session HTTP/1.1" 500 717
28worker_1 | [2021-10-31 03:58:41,907: DEBUG/ForkPoolWorker-11] Finished Request
29worker_1 | [2021-10-31 03:58:41,946: DEBUG/ForkPoolWorker-12] Using selector: EpollSelector
30worker_1 | [WDM] -
31worker_1 |
32worker_1 | [2021-10-31 03:58:41,962: INFO/ForkPoolWorker-12]
33worker_1 |
34worker_1 | [WDM] - ====== WebDriver manager ======
35worker_1 | [2021-10-31 03:58:41,971: INFO/ForkPoolWorker-12] ====== WebDriver manager ======
36worker_1 | [2021-10-31 03:58:42,112: WARNING/ForkPoolWorker-11] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
37worker_1 | (Session info: headless chrome=95.0.4638.69)
38worker_1 | Stacktrace:
39worker_1 | #0 0x004000a18f93 <unknown>
40worker_1 | #1 0x0040004f3908 <unknown>
41worker_1 | #2 0x0040004d3cdf <unknown>
42worker_1 | #3 0x00400054cabe <unknown>
43worker_1 | #4 0x004000546973 <unknown>
44worker_1 | #5 0x00400051cdf4 <unknown>
45worker_1 | #6 0x00400051dde5 <unknown>
46worker_1 | #7 0x004000a482be <unknown>
47worker_1 | #8 0x004000a5dba0 <unknown>
48worker_1 | #9 0x004000a49215 <unknown>
49worker_1 | #10 0x004000a5efe8 <unknown>
50worker_1 | #11 0x004000a3d9db <unknown>
51worker_1 | #12 0x004000a7a218 <unknown>
52worker_1 | #13 0x004000a7a398 <unknown>
53worker_1 | #14 0x004000a956cd <unknown>
54worker_1 | #15 0x004002b29609 <unknown>
55worker_1 |
56worker_1 | [2021-10-31 03:58:42,113: WARNING/ForkPoolWorker-11]
57worker_1 |
58worker_1 | [2021-10-31 03:58:42,166: DEBUG/ForkPoolWorker-11] Using selector: EpollSelector
59worker_1 | [WDM] -
60worker_1 |
61worker_1 | [2021-10-31 03:58:42,169: INFO/ForkPoolWorker-11]
62worker_1 |
63worker_1 | [WDM] - ====== WebDriver manager ======
64worker_1 | [2021-10-31 03:58:42,170: INFO/ForkPoolWorker-11] ====== WebDriver manager ======
65worker_1 | [2021-10-31 03:58:42,702: DEBUG/ForkPoolWorker-9] http://localhost:51793 "POST /session HTTP/1.1" 500 866
66worker_1 | [2021-10-31 03:58:42,719: DEBUG/ForkPoolWorker-9] Finished Request
67worker_1 | [2021-10-31 03:58:42,986: WARNING/ForkPoolWorker-9] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
68worker_1 | (chrome not reachable)
69worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
70worker_1 | Stacktrace:
71worker_1 | #0 0x004000a18f93 <unknown>
72worker_1 | #1 0x0040004f3908 <unknown>
73worker_1 | #2 0x004000516b32 <unknown>
74worker_1 | #3 0x00400051265d <unknown>
75worker_1 | #4 0x00400054c770 <unknown>
76worker_1 | #5 0x004000546973 <unknown>
77worker_1 | #6 0x00400051cdf4 <unknown>
78worker_1 | #7 0x00400051dde5 <unknown>
79worker_1 | #8 0x004000a482be <unknown>
80worker_1 | #9 0x004000a5dba0 <unknown>
81worker_1 | #10 0x004000a49215 <unknown>
82worker_1 | #11 0x004000a5efe8 <unknown>
83worker_1 | #12 0x004000a3d9db <unknown>
84worker_1 | #13 0x004000a7a218 <unknown>
85worker_1 | #14 0x004000a7a398 <unknown>
86worker_1 | #15 0x004000a956cd <unknown>
87worker_1 | #16 0x004002b29609 <unknown>
88worker_1 |
89worker_1 | [2021-10-31 03:58:42,987: WARNING/ForkPoolWorker-9]
90worker_1 |
91worker_1 | [2021-10-31 03:58:43,045: DEBUG/ForkPoolWorker-9] Using selector: EpollSelector
92worker_1 | [WDM] -
93worker_1 |
94worker_1 | [2021-10-31 03:58:43,049: INFO/ForkPoolWorker-9]
95worker_1 |
96worker_1 | [WDM] - ====== WebDriver manager ======
97worker_1 | [2021-10-31 03:58:43,050: INFO/ForkPoolWorker-9] ====== WebDriver manager ======
98worker_1 | [2021-10-31 03:58:43,936: DEBUG/ForkPoolWorker-10] http://localhost:43035 "POST /session HTTP/1.1" 500 866
99worker_1 | [2021-10-31 03:58:43,952: DEBUG/ForkPoolWorker-10] Finished Request
100worker_1 | [2021-10-31 03:58:44,163: WARNING/ForkPoolWorker-10] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
101worker_1 | (chrome not reachable)
102worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
103worker_1 | Stacktrace:
104worker_1 | #0 0x004000a18f93 <unknown>
105worker_1 | #1 0x0040004f3908 <unknown>
106worker_1 | #2 0x004000516b32 <unknown>
107worker_1 | #3 0x00400051265d <unknown>
108worker_1 | #4 0x00400054c770 <unknown>
109worker_1 | #5 0x004000546973 <unknown>
110worker_1 | #6 0x00400051cdf4 <unknown>
111worker_1 | #7 0x00400051dde5 <unknown>
112worker_1 | #8 0x004000a482be <unknown>
113worker_1 | #9 0x004000a5dba0 <unknown>
114worker_1 | #10 0x004000a49215 <unknown>
115worker_1 | #11 0x004000a5efe8 <unknown>
116worker_1 | #12 0x004000a3d9db <unknown>
117worker_1 | #13 0x004000a7a218 <unknown>
118worker_1 | #14 0x004000a7a398 <unknown>
119worker_1 | #15 0x004000a956cd <unknown>
120worker_1 | #16 0x004002b29609 <unknown>
121worker_1 |
122worker_1 | [2021-10-31 03:58:44,164: WARNING/ForkPoolWorker-10]
123worker_1 |
124worker_1 | [2021-10-31 03:58:44,205: DEBUG/ForkPoolWorker-10] Using selector: EpollSelector
125worker_1 | [WDM] -
126worker_1 |
127worker_1 | [2021-10-31 03:58:44,215: INFO/ForkPoolWorker-10]
128worker_1 |
129worker_1 | [WDM] - ====== WebDriver manager ======
130worker_1 | [2021-10-31 03:58:44,217: INFO/ForkPoolWorker-10] ====== WebDriver manager ======
131worker_1 | [WDM] - Current google-chrome version is 95.0.4638
132worker_1 | [2021-10-31 03:58:44,520: INFO/ForkPoolWorker-12] Current google-chrome version is 95.0.4638
133worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
134worker_1 | [2021-10-31 03:58:44,525: INFO/ForkPoolWorker-12] Get LATEST driver version for 95.0.4638
135worker_1 | [WDM] - Current google-chrome version is 95.0.4638
136worker_1 | [2021-10-31 03:58:44,590: INFO/ForkPoolWorker-11] Current google-chrome version is 95.0.4638
137worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
138worker_1 | [2021-10-31 03:58:44,593: INFO/ForkPoolWorker-11] Get LATEST driver version for 95.0.4638
139worker_1 | [2021-10-31 03:58:44,599: DEBUG/ForkPoolWorker-12] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
140worker_1 | [2021-10-31 03:58:44,826: DEBUG/ForkPoolWorker-11] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
141worker_1 | [2021-10-31 03:58:45,205: DEBUG/ForkPoolWorker-11] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
142worker_1 | [2021-10-31 03:58:45,213: DEBUG/ForkPoolWorker-12] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
143worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
144worker_1 | [2021-10-31 03:58:45,219: INFO/ForkPoolWorker-11] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
145worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
146worker_1 | [2021-10-31 03:58:45,242: INFO/ForkPoolWorker-12] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
147worker_1 | [WDM] - Current google-chrome version is 95.0.4638
148worker_1 | [2021-10-31 03:58:45,603: INFO/ForkPoolWorker-9] Current google-chrome version is 95.0.4638
149worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
150worker_1 | [2021-10-31 03:58:45,610: INFO/ForkPoolWorker-9] Get LATEST driver version for 95.0.4638
151ubuntu@742a62c61201:/backend$ google-chrome --no-sandbox --disable-dev-shm-usage --disable-gpu --remote-debugging-port=9222 --headless
152qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
153qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
154[1031/041139.297323:ERROR:bus.cc(392)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
155[1031/041139.310612:ERROR:file_path_watcher_linux.cc(326)] inotify_init() failed: Function not implemented (38)
156
157DevTools listening on ws://127.0.0.1:9222/devtools/browser/32b15b93-3fe0-4cb8-9c96-8aea011686a8
158qemu: unknown option 'type=utility'
159[1031/041139.463057:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
160[1031/041139.463227:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 1 time(s)
161[1031/041139.543335:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
162qemu: unknown option 'type=utility'
163[1031/041139.718793:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
164[1031/041139.718877:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 2 time(s)
165[1031/041139.736641:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
166qemu: unknown option 'type=utility'
167[1031/041139.788529:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
168[1031/041139.788615:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 3 time(s)
169[1031/041139.798487:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
170[1031/041139.808256:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
171[1031/041139.808372:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 4 time(s)
172qemu: unknown option 'type=utility'
173[1031/041139.825267:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
174[1031/041139.825354:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 5 time(s)
175[1031/041139.830175:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
176[1031/041139.839159:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
177[1031/041139.839345:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 6 time(s)
178[1031/041139.839816:FATAL:gpu_data_manager_impl_private.cc(417)] GPU process isn't usable. Goodbye.
179qemu: uncaught target signal 11 (Segmentation fault) - core dumped
180Segmentation fault
181ubuntu@742a62c61201:/backend$ qemu: unknown option 'type=utility'
182
183ubuntu@742a62c61201:/backend$
184FROM --platform=linux/amd64 ubuntu:20.04
185
186ENV DEBIAN_FRONTEND noninteractive
187
188RUN apt update -y && apt install python3.9 python3-pip python-is-python3 sudo wget -y
189
190RUN pip install --upgrade pip
191
192# set environment variables
193ENV PYTHONDONTWRITEBYTECODE 1
194ENV PYTHONUNBUFFERED 1
195
196RUN adduser --disabled-password --gecos '' ubuntu
197RUN adduser ubuntu sudo
198RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
199
200USER ubuntu
201
202RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
203RUN echo "deb http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google.list
204RUN sudo apt update -y && sudo apt install -y google-chrome-stable
205
206ENV PATH="/home/ubuntu/.local/bin:$PATH"
207
208WORKDIR /backend
209
210COPY requirements.txt ./
211
212RUN pip install -r requirements.txt --no-cache-dir
213
214COPY . .
215
216ENV DISPLAY=:99
217
218ENTRYPOINT [ "./run-celery.sh" ]
219
docker-compose.yml
1worker_1 | [2021-10-31 03:58:23,286: DEBUG/ForkPoolWorker-10] POST http://localhost:43035/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}
2worker_1 | [2021-10-31 03:58:23,330: DEBUG/ForkPoolWorker-10] Starting new HTTP connection (1): localhost:43035
3worker_1 | [2021-10-31 03:58:41,311: DEBUG/ForkPoolWorker-12] http://localhost:47089 "POST /session HTTP/1.1" 500 717
4worker_1 | [2021-10-31 03:58:41,412: DEBUG/ForkPoolWorker-12] Finished Request
5worker_1 | [2021-10-31 03:58:41,825: WARNING/ForkPoolWorker-12] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
6worker_1 | (Session info: headless chrome=95.0.4638.69)
7worker_1 | Stacktrace:
8worker_1 | #0 0x004000a18f93 <unknown>
9worker_1 | #1 0x0040004f3908 <unknown>
10worker_1 | #2 0x0040004d3cdf <unknown>
11worker_1 | #3 0x00400054cabe <unknown>
12worker_1 | #4 0x004000546973 <unknown>
13worker_1 | #5 0x00400051cdf4 <unknown>
14worker_1 | #6 0x00400051dde5 <unknown>
15worker_1 | #7 0x004000a482be <unknown>
16worker_1 | #8 0x004000a5dba0 <unknown>
17worker_1 | #9 0x004000a49215 <unknown>
18worker_1 | #10 0x004000a5efe8 <unknown>
19worker_1 | #11 0x004000a3d9db <unknown>
20worker_1 | #12 0x004000a7a218 <unknown>
21worker_1 | #13 0x004000a7a398 <unknown>
22worker_1 | #14 0x004000a956cd <unknown>
23worker_1 | #15 0x004002b29609 <unknown>
24worker_1 |
25worker_1 | [2021-10-31 03:58:41,826: WARNING/ForkPoolWorker-12]
26worker_1 |
27worker_1 | [2021-10-31 03:58:41,867: DEBUG/ForkPoolWorker-11] http://localhost:58147 "POST /session HTTP/1.1" 500 717
28worker_1 | [2021-10-31 03:58:41,907: DEBUG/ForkPoolWorker-11] Finished Request
29worker_1 | [2021-10-31 03:58:41,946: DEBUG/ForkPoolWorker-12] Using selector: EpollSelector
30worker_1 | [WDM] -
31worker_1 |
32worker_1 | [2021-10-31 03:58:41,962: INFO/ForkPoolWorker-12]
33worker_1 |
34worker_1 | [WDM] - ====== WebDriver manager ======
35worker_1 | [2021-10-31 03:58:41,971: INFO/ForkPoolWorker-12] ====== WebDriver manager ======
36worker_1 | [2021-10-31 03:58:42,112: WARNING/ForkPoolWorker-11] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
37worker_1 | (Session info: headless chrome=95.0.4638.69)
38worker_1 | Stacktrace:
39worker_1 | #0 0x004000a18f93 <unknown>
40worker_1 | #1 0x0040004f3908 <unknown>
41worker_1 | #2 0x0040004d3cdf <unknown>
42worker_1 | #3 0x00400054cabe <unknown>
43worker_1 | #4 0x004000546973 <unknown>
44worker_1 | #5 0x00400051cdf4 <unknown>
45worker_1 | #6 0x00400051dde5 <unknown>
46worker_1 | #7 0x004000a482be <unknown>
47worker_1 | #8 0x004000a5dba0 <unknown>
48worker_1 | #9 0x004000a49215 <unknown>
49worker_1 | #10 0x004000a5efe8 <unknown>
50worker_1 | #11 0x004000a3d9db <unknown>
51worker_1 | #12 0x004000a7a218 <unknown>
52worker_1 | #13 0x004000a7a398 <unknown>
53worker_1 | #14 0x004000a956cd <unknown>
54worker_1 | #15 0x004002b29609 <unknown>
55worker_1 |
56worker_1 | [2021-10-31 03:58:42,113: WARNING/ForkPoolWorker-11]
57worker_1 |
58worker_1 | [2021-10-31 03:58:42,166: DEBUG/ForkPoolWorker-11] Using selector: EpollSelector
59worker_1 | [WDM] -
60worker_1 |
61worker_1 | [2021-10-31 03:58:42,169: INFO/ForkPoolWorker-11]
62worker_1 |
63worker_1 | [WDM] - ====== WebDriver manager ======
64worker_1 | [2021-10-31 03:58:42,170: INFO/ForkPoolWorker-11] ====== WebDriver manager ======
65worker_1 | [2021-10-31 03:58:42,702: DEBUG/ForkPoolWorker-9] http://localhost:51793 "POST /session HTTP/1.1" 500 866
66worker_1 | [2021-10-31 03:58:42,719: DEBUG/ForkPoolWorker-9] Finished Request
67worker_1 | [2021-10-31 03:58:42,986: WARNING/ForkPoolWorker-9] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
68worker_1 | (chrome not reachable)
69worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
70worker_1 | Stacktrace:
71worker_1 | #0 0x004000a18f93 <unknown>
72worker_1 | #1 0x0040004f3908 <unknown>
73worker_1 | #2 0x004000516b32 <unknown>
74worker_1 | #3 0x00400051265d <unknown>
75worker_1 | #4 0x00400054c770 <unknown>
76worker_1 | #5 0x004000546973 <unknown>
77worker_1 | #6 0x00400051cdf4 <unknown>
78worker_1 | #7 0x00400051dde5 <unknown>
79worker_1 | #8 0x004000a482be <unknown>
80worker_1 | #9 0x004000a5dba0 <unknown>
81worker_1 | #10 0x004000a49215 <unknown>
82worker_1 | #11 0x004000a5efe8 <unknown>
83worker_1 | #12 0x004000a3d9db <unknown>
84worker_1 | #13 0x004000a7a218 <unknown>
85worker_1 | #14 0x004000a7a398 <unknown>
86worker_1 | #15 0x004000a956cd <unknown>
87worker_1 | #16 0x004002b29609 <unknown>
88worker_1 |
89worker_1 | [2021-10-31 03:58:42,987: WARNING/ForkPoolWorker-9]
90worker_1 |
91worker_1 | [2021-10-31 03:58:43,045: DEBUG/ForkPoolWorker-9] Using selector: EpollSelector
92worker_1 | [WDM] -
93worker_1 |
94worker_1 | [2021-10-31 03:58:43,049: INFO/ForkPoolWorker-9]
95worker_1 |
96worker_1 | [WDM] - ====== WebDriver manager ======
97worker_1 | [2021-10-31 03:58:43,050: INFO/ForkPoolWorker-9] ====== WebDriver manager ======
98worker_1 | [2021-10-31 03:58:43,936: DEBUG/ForkPoolWorker-10] http://localhost:43035 "POST /session HTTP/1.1" 500 866
99worker_1 | [2021-10-31 03:58:43,952: DEBUG/ForkPoolWorker-10] Finished Request
100worker_1 | [2021-10-31 03:58:44,163: WARNING/ForkPoolWorker-10] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
101worker_1 | (chrome not reachable)
102worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
103worker_1 | Stacktrace:
104worker_1 | #0 0x004000a18f93 <unknown>
105worker_1 | #1 0x0040004f3908 <unknown>
106worker_1 | #2 0x004000516b32 <unknown>
107worker_1 | #3 0x00400051265d <unknown>
108worker_1 | #4 0x00400054c770 <unknown>
109worker_1 | #5 0x004000546973 <unknown>
110worker_1 | #6 0x00400051cdf4 <unknown>
111worker_1 | #7 0x00400051dde5 <unknown>
112worker_1 | #8 0x004000a482be <unknown>
113worker_1 | #9 0x004000a5dba0 <unknown>
114worker_1 | #10 0x004000a49215 <unknown>
115worker_1 | #11 0x004000a5efe8 <unknown>
116worker_1 | #12 0x004000a3d9db <unknown>
117worker_1 | #13 0x004000a7a218 <unknown>
118worker_1 | #14 0x004000a7a398 <unknown>
119worker_1 | #15 0x004000a956cd <unknown>
120worker_1 | #16 0x004002b29609 <unknown>
121worker_1 |
122worker_1 | [2021-10-31 03:58:44,164: WARNING/ForkPoolWorker-10]
123worker_1 |
124worker_1 | [2021-10-31 03:58:44,205: DEBUG/ForkPoolWorker-10] Using selector: EpollSelector
125worker_1 | [WDM] -
126worker_1 |
127worker_1 | [2021-10-31 03:58:44,215: INFO/ForkPoolWorker-10]
128worker_1 |
129worker_1 | [WDM] - ====== WebDriver manager ======
130worker_1 | [2021-10-31 03:58:44,217: INFO/ForkPoolWorker-10] ====== WebDriver manager ======
131worker_1 | [WDM] - Current google-chrome version is 95.0.4638
132worker_1 | [2021-10-31 03:58:44,520: INFO/ForkPoolWorker-12] Current google-chrome version is 95.0.4638
133worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
134worker_1 | [2021-10-31 03:58:44,525: INFO/ForkPoolWorker-12] Get LATEST driver version for 95.0.4638
135worker_1 | [WDM] - Current google-chrome version is 95.0.4638
136worker_1 | [2021-10-31 03:58:44,590: INFO/ForkPoolWorker-11] Current google-chrome version is 95.0.4638
137worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
138worker_1 | [2021-10-31 03:58:44,593: INFO/ForkPoolWorker-11] Get LATEST driver version for 95.0.4638
139worker_1 | [2021-10-31 03:58:44,599: DEBUG/ForkPoolWorker-12] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
140worker_1 | [2021-10-31 03:58:44,826: DEBUG/ForkPoolWorker-11] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
141worker_1 | [2021-10-31 03:58:45,205: DEBUG/ForkPoolWorker-11] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
142worker_1 | [2021-10-31 03:58:45,213: DEBUG/ForkPoolWorker-12] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
143worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
144worker_1 | [2021-10-31 03:58:45,219: INFO/ForkPoolWorker-11] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
145worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
146worker_1 | [2021-10-31 03:58:45,242: INFO/ForkPoolWorker-12] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
147worker_1 | [WDM] - Current google-chrome version is 95.0.4638
148worker_1 | [2021-10-31 03:58:45,603: INFO/ForkPoolWorker-9] Current google-chrome version is 95.0.4638
149worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
150worker_1 | [2021-10-31 03:58:45,610: INFO/ForkPoolWorker-9] Get LATEST driver version for 95.0.4638
151ubuntu@742a62c61201:/backend$ google-chrome --no-sandbox --disable-dev-shm-usage --disable-gpu --remote-debugging-port=9222 --headless
152qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
153qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
154[1031/041139.297323:ERROR:bus.cc(392)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
155[1031/041139.310612:ERROR:file_path_watcher_linux.cc(326)] inotify_init() failed: Function not implemented (38)
156
157DevTools listening on ws://127.0.0.1:9222/devtools/browser/32b15b93-3fe0-4cb8-9c96-8aea011686a8
158qemu: unknown option 'type=utility'
159[1031/041139.463057:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
160[1031/041139.463227:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 1 time(s)
161[1031/041139.543335:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
162qemu: unknown option 'type=utility'
163[1031/041139.718793:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
164[1031/041139.718877:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 2 time(s)
165[1031/041139.736641:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
166qemu: unknown option 'type=utility'
167[1031/041139.788529:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
168[1031/041139.788615:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 3 time(s)
169[1031/041139.798487:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
170[1031/041139.808256:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
171[1031/041139.808372:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 4 time(s)
172qemu: unknown option 'type=utility'
173[1031/041139.825267:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
174[1031/041139.825354:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 5 time(s)
175[1031/041139.830175:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
176[1031/041139.839159:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
177[1031/041139.839345:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 6 time(s)
178[1031/041139.839816:FATAL:gpu_data_manager_impl_private.cc(417)] GPU process isn't usable. Goodbye.
179qemu: uncaught target signal 11 (Segmentation fault) - core dumped
180Segmentation fault
181ubuntu@742a62c61201:/backend$ qemu: unknown option 'type=utility'
182
183ubuntu@742a62c61201:/backend$
184FROM --platform=linux/amd64 ubuntu:20.04
185
186ENV DEBIAN_FRONTEND noninteractive
187
188RUN apt update -y && apt install python3.9 python3-pip python-is-python3 sudo wget -y
189
190RUN pip install --upgrade pip
191
192# set environment variables
193ENV PYTHONDONTWRITEBYTECODE 1
194ENV PYTHONUNBUFFERED 1
195
196RUN adduser --disabled-password --gecos '' ubuntu
197RUN adduser ubuntu sudo
198RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
199
200USER ubuntu
201
202RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
203RUN echo "deb http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google.list
204RUN sudo apt update -y && sudo apt install -y google-chrome-stable
205
206ENV PATH="/home/ubuntu/.local/bin:$PATH"
207
208WORKDIR /backend
209
210COPY requirements.txt ./
211
212RUN pip install -r requirements.txt --no-cache-dir
213
214COPY . .
215
216ENV DISPLAY=:99
217
218ENTRYPOINT [ "./run-celery.sh" ]
219version: "3.3"
220
221services:
222 frontend:
223 build:
224 context: ./frontend
225 ports:
226 - "3000:3000"
227 volumes:
228 - ./frontend:/frontend
229 depends_on:
230 - backend
231 deploy:
232 resources:
233 limits:
234 cpus: "2"
235 memory: 4G
236 reservations:
237 cpus: "0.5"
238 memory: 512M
239 tty: true
240 stdin_open: true
241
242 backend:
243 build: ./backend
244 ports:
245 - "8000:8000"
246 volumes:
247 - ./backend:/backend
248 networks:
249 - redis-network
250 depends_on:
251 - redis
252 - worker
253 environment:
254 - is_docker=1
255 deploy:
256 resources:
257 limits:
258 cpus: "2"
259 memory: 4G
260 reservations:
261 cpus: "0.5"
262 memory: 512M
263 tty: true
264
265 worker:
266 build:
267 context: ./backend
268 dockerfile: ./celery-dockerfile/Dockerfile
269 deploy:
270 resources:
271 limits:
272 cpus: "2"
273 memory: 4G
274 reservations:
275 cpus: "0.5"
276 memory: 4G
277 volumes:
278 - ./backend:/backend
279 networks:
280 - redis-network
281 depends_on:
282 - redis
283 environment:
284 - is_docker=1
285 privileged: true
286 tty: true
287 platform: linux/amd64
288
289 redis:
290 image: redis:alpine
291 command: redis-server --port 6379
292 container_name: redis_server
293 hostname: redis_server
294 labels:
295 - "name=redis"
296 - "mode=standalone"
297 networks:
298 - redis-network
299 expose:
300 - "6379"
301 tty: true
302
303networks:
304 redis-network:
305
Crawler full code from AutoCrawler repository. if you want to full crawler code, it's better checkout this code.
I've changed options during trial and error.
1worker_1 | [2021-10-31 03:58:23,286: DEBUG/ForkPoolWorker-10] POST http://localhost:43035/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}
2worker_1 | [2021-10-31 03:58:23,330: DEBUG/ForkPoolWorker-10] Starting new HTTP connection (1): localhost:43035
3worker_1 | [2021-10-31 03:58:41,311: DEBUG/ForkPoolWorker-12] http://localhost:47089 "POST /session HTTP/1.1" 500 717
4worker_1 | [2021-10-31 03:58:41,412: DEBUG/ForkPoolWorker-12] Finished Request
5worker_1 | [2021-10-31 03:58:41,825: WARNING/ForkPoolWorker-12] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
6worker_1 | (Session info: headless chrome=95.0.4638.69)
7worker_1 | Stacktrace:
8worker_1 | #0 0x004000a18f93 <unknown>
9worker_1 | #1 0x0040004f3908 <unknown>
10worker_1 | #2 0x0040004d3cdf <unknown>
11worker_1 | #3 0x00400054cabe <unknown>
12worker_1 | #4 0x004000546973 <unknown>
13worker_1 | #5 0x00400051cdf4 <unknown>
14worker_1 | #6 0x00400051dde5 <unknown>
15worker_1 | #7 0x004000a482be <unknown>
16worker_1 | #8 0x004000a5dba0 <unknown>
17worker_1 | #9 0x004000a49215 <unknown>
18worker_1 | #10 0x004000a5efe8 <unknown>
19worker_1 | #11 0x004000a3d9db <unknown>
20worker_1 | #12 0x004000a7a218 <unknown>
21worker_1 | #13 0x004000a7a398 <unknown>
22worker_1 | #14 0x004000a956cd <unknown>
23worker_1 | #15 0x004002b29609 <unknown>
24worker_1 |
25worker_1 | [2021-10-31 03:58:41,826: WARNING/ForkPoolWorker-12]
26worker_1 |
27worker_1 | [2021-10-31 03:58:41,867: DEBUG/ForkPoolWorker-11] http://localhost:58147 "POST /session HTTP/1.1" 500 717
28worker_1 | [2021-10-31 03:58:41,907: DEBUG/ForkPoolWorker-11] Finished Request
29worker_1 | [2021-10-31 03:58:41,946: DEBUG/ForkPoolWorker-12] Using selector: EpollSelector
30worker_1 | [WDM] -
31worker_1 |
32worker_1 | [2021-10-31 03:58:41,962: INFO/ForkPoolWorker-12]
33worker_1 |
34worker_1 | [WDM] - ====== WebDriver manager ======
35worker_1 | [2021-10-31 03:58:41,971: INFO/ForkPoolWorker-12] ====== WebDriver manager ======
36worker_1 | [2021-10-31 03:58:42,112: WARNING/ForkPoolWorker-11] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
37worker_1 | (Session info: headless chrome=95.0.4638.69)
38worker_1 | Stacktrace:
39worker_1 | #0 0x004000a18f93 <unknown>
40worker_1 | #1 0x0040004f3908 <unknown>
41worker_1 | #2 0x0040004d3cdf <unknown>
42worker_1 | #3 0x00400054cabe <unknown>
43worker_1 | #4 0x004000546973 <unknown>
44worker_1 | #5 0x00400051cdf4 <unknown>
45worker_1 | #6 0x00400051dde5 <unknown>
46worker_1 | #7 0x004000a482be <unknown>
47worker_1 | #8 0x004000a5dba0 <unknown>
48worker_1 | #9 0x004000a49215 <unknown>
49worker_1 | #10 0x004000a5efe8 <unknown>
50worker_1 | #11 0x004000a3d9db <unknown>
51worker_1 | #12 0x004000a7a218 <unknown>
52worker_1 | #13 0x004000a7a398 <unknown>
53worker_1 | #14 0x004000a956cd <unknown>
54worker_1 | #15 0x004002b29609 <unknown>
55worker_1 |
56worker_1 | [2021-10-31 03:58:42,113: WARNING/ForkPoolWorker-11]
57worker_1 |
58worker_1 | [2021-10-31 03:58:42,166: DEBUG/ForkPoolWorker-11] Using selector: EpollSelector
59worker_1 | [WDM] -
60worker_1 |
61worker_1 | [2021-10-31 03:58:42,169: INFO/ForkPoolWorker-11]
62worker_1 |
63worker_1 | [WDM] - ====== WebDriver manager ======
64worker_1 | [2021-10-31 03:58:42,170: INFO/ForkPoolWorker-11] ====== WebDriver manager ======
65worker_1 | [2021-10-31 03:58:42,702: DEBUG/ForkPoolWorker-9] http://localhost:51793 "POST /session HTTP/1.1" 500 866
66worker_1 | [2021-10-31 03:58:42,719: DEBUG/ForkPoolWorker-9] Finished Request
67worker_1 | [2021-10-31 03:58:42,986: WARNING/ForkPoolWorker-9] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
68worker_1 | (chrome not reachable)
69worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
70worker_1 | Stacktrace:
71worker_1 | #0 0x004000a18f93 <unknown>
72worker_1 | #1 0x0040004f3908 <unknown>
73worker_1 | #2 0x004000516b32 <unknown>
74worker_1 | #3 0x00400051265d <unknown>
75worker_1 | #4 0x00400054c770 <unknown>
76worker_1 | #5 0x004000546973 <unknown>
77worker_1 | #6 0x00400051cdf4 <unknown>
78worker_1 | #7 0x00400051dde5 <unknown>
79worker_1 | #8 0x004000a482be <unknown>
80worker_1 | #9 0x004000a5dba0 <unknown>
81worker_1 | #10 0x004000a49215 <unknown>
82worker_1 | #11 0x004000a5efe8 <unknown>
83worker_1 | #12 0x004000a3d9db <unknown>
84worker_1 | #13 0x004000a7a218 <unknown>
85worker_1 | #14 0x004000a7a398 <unknown>
86worker_1 | #15 0x004000a956cd <unknown>
87worker_1 | #16 0x004002b29609 <unknown>
88worker_1 |
89worker_1 | [2021-10-31 03:58:42,987: WARNING/ForkPoolWorker-9]
90worker_1 |
91worker_1 | [2021-10-31 03:58:43,045: DEBUG/ForkPoolWorker-9] Using selector: EpollSelector
92worker_1 | [WDM] -
93worker_1 |
94worker_1 | [2021-10-31 03:58:43,049: INFO/ForkPoolWorker-9]
95worker_1 |
96worker_1 | [WDM] - ====== WebDriver manager ======
97worker_1 | [2021-10-31 03:58:43,050: INFO/ForkPoolWorker-9] ====== WebDriver manager ======
98worker_1 | [2021-10-31 03:58:43,936: DEBUG/ForkPoolWorker-10] http://localhost:43035 "POST /session HTTP/1.1" 500 866
99worker_1 | [2021-10-31 03:58:43,952: DEBUG/ForkPoolWorker-10] Finished Request
100worker_1 | [2021-10-31 03:58:44,163: WARNING/ForkPoolWorker-10] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
101worker_1 | (chrome not reachable)
102worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
103worker_1 | Stacktrace:
104worker_1 | #0 0x004000a18f93 <unknown>
105worker_1 | #1 0x0040004f3908 <unknown>
106worker_1 | #2 0x004000516b32 <unknown>
107worker_1 | #3 0x00400051265d <unknown>
108worker_1 | #4 0x00400054c770 <unknown>
109worker_1 | #5 0x004000546973 <unknown>
110worker_1 | #6 0x00400051cdf4 <unknown>
111worker_1 | #7 0x00400051dde5 <unknown>
112worker_1 | #8 0x004000a482be <unknown>
113worker_1 | #9 0x004000a5dba0 <unknown>
114worker_1 | #10 0x004000a49215 <unknown>
115worker_1 | #11 0x004000a5efe8 <unknown>
116worker_1 | #12 0x004000a3d9db <unknown>
117worker_1 | #13 0x004000a7a218 <unknown>
118worker_1 | #14 0x004000a7a398 <unknown>
119worker_1 | #15 0x004000a956cd <unknown>
120worker_1 | #16 0x004002b29609 <unknown>
121worker_1 |
122worker_1 | [2021-10-31 03:58:44,164: WARNING/ForkPoolWorker-10]
123worker_1 |
124worker_1 | [2021-10-31 03:58:44,205: DEBUG/ForkPoolWorker-10] Using selector: EpollSelector
125worker_1 | [WDM] -
126worker_1 |
127worker_1 | [2021-10-31 03:58:44,215: INFO/ForkPoolWorker-10]
128worker_1 |
129worker_1 | [WDM] - ====== WebDriver manager ======
130worker_1 | [2021-10-31 03:58:44,217: INFO/ForkPoolWorker-10] ====== WebDriver manager ======
131worker_1 | [WDM] - Current google-chrome version is 95.0.4638
132worker_1 | [2021-10-31 03:58:44,520: INFO/ForkPoolWorker-12] Current google-chrome version is 95.0.4638
133worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
134worker_1 | [2021-10-31 03:58:44,525: INFO/ForkPoolWorker-12] Get LATEST driver version for 95.0.4638
135worker_1 | [WDM] - Current google-chrome version is 95.0.4638
136worker_1 | [2021-10-31 03:58:44,590: INFO/ForkPoolWorker-11] Current google-chrome version is 95.0.4638
137worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
138worker_1 | [2021-10-31 03:58:44,593: INFO/ForkPoolWorker-11] Get LATEST driver version for 95.0.4638
139worker_1 | [2021-10-31 03:58:44,599: DEBUG/ForkPoolWorker-12] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
140worker_1 | [2021-10-31 03:58:44,826: DEBUG/ForkPoolWorker-11] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
141worker_1 | [2021-10-31 03:58:45,205: DEBUG/ForkPoolWorker-11] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
142worker_1 | [2021-10-31 03:58:45,213: DEBUG/ForkPoolWorker-12] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
143worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
144worker_1 | [2021-10-31 03:58:45,219: INFO/ForkPoolWorker-11] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
145worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
146worker_1 | [2021-10-31 03:58:45,242: INFO/ForkPoolWorker-12] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
147worker_1 | [WDM] - Current google-chrome version is 95.0.4638
148worker_1 | [2021-10-31 03:58:45,603: INFO/ForkPoolWorker-9] Current google-chrome version is 95.0.4638
149worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
150worker_1 | [2021-10-31 03:58:45,610: INFO/ForkPoolWorker-9] Get LATEST driver version for 95.0.4638
151ubuntu@742a62c61201:/backend$ google-chrome --no-sandbox --disable-dev-shm-usage --disable-gpu --remote-debugging-port=9222 --headless
152qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
153qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
154[1031/041139.297323:ERROR:bus.cc(392)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
155[1031/041139.310612:ERROR:file_path_watcher_linux.cc(326)] inotify_init() failed: Function not implemented (38)
156
157DevTools listening on ws://127.0.0.1:9222/devtools/browser/32b15b93-3fe0-4cb8-9c96-8aea011686a8
158qemu: unknown option 'type=utility'
159[1031/041139.463057:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
160[1031/041139.463227:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 1 time(s)
161[1031/041139.543335:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
162qemu: unknown option 'type=utility'
163[1031/041139.718793:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
164[1031/041139.718877:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 2 time(s)
165[1031/041139.736641:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
166qemu: unknown option 'type=utility'
167[1031/041139.788529:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
168[1031/041139.788615:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 3 time(s)
169[1031/041139.798487:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
170[1031/041139.808256:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
171[1031/041139.808372:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 4 time(s)
172qemu: unknown option 'type=utility'
173[1031/041139.825267:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
174[1031/041139.825354:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 5 time(s)
175[1031/041139.830175:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
176[1031/041139.839159:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
177[1031/041139.839345:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 6 time(s)
178[1031/041139.839816:FATAL:gpu_data_manager_impl_private.cc(417)] GPU process isn't usable. Goodbye.
179qemu: uncaught target signal 11 (Segmentation fault) - core dumped
180Segmentation fault
181ubuntu@742a62c61201:/backend$ qemu: unknown option 'type=utility'
182
183ubuntu@742a62c61201:/backend$
184FROM --platform=linux/amd64 ubuntu:20.04
185
186ENV DEBIAN_FRONTEND noninteractive
187
188RUN apt update -y && apt install python3.9 python3-pip python-is-python3 sudo wget -y
189
190RUN pip install --upgrade pip
191
192# set environment variables
193ENV PYTHONDONTWRITEBYTECODE 1
194ENV PYTHONUNBUFFERED 1
195
196RUN adduser --disabled-password --gecos '' ubuntu
197RUN adduser ubuntu sudo
198RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
199
200USER ubuntu
201
202RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
203RUN echo "deb http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google.list
204RUN sudo apt update -y && sudo apt install -y google-chrome-stable
205
206ENV PATH="/home/ubuntu/.local/bin:$PATH"
207
208WORKDIR /backend
209
210COPY requirements.txt ./
211
212RUN pip install -r requirements.txt --no-cache-dir
213
214COPY . .
215
216ENV DISPLAY=:99
217
218ENTRYPOINT [ "./run-celery.sh" ]
219version: "3.3"
220
221services:
222 frontend:
223 build:
224 context: ./frontend
225 ports:
226 - "3000:3000"
227 volumes:
228 - ./frontend:/frontend
229 depends_on:
230 - backend
231 deploy:
232 resources:
233 limits:
234 cpus: "2"
235 memory: 4G
236 reservations:
237 cpus: "0.5"
238 memory: 512M
239 tty: true
240 stdin_open: true
241
242 backend:
243 build: ./backend
244 ports:
245 - "8000:8000"
246 volumes:
247 - ./backend:/backend
248 networks:
249 - redis-network
250 depends_on:
251 - redis
252 - worker
253 environment:
254 - is_docker=1
255 deploy:
256 resources:
257 limits:
258 cpus: "2"
259 memory: 4G
260 reservations:
261 cpus: "0.5"
262 memory: 512M
263 tty: true
264
265 worker:
266 build:
267 context: ./backend
268 dockerfile: ./celery-dockerfile/Dockerfile
269 deploy:
270 resources:
271 limits:
272 cpus: "2"
273 memory: 4G
274 reservations:
275 cpus: "0.5"
276 memory: 4G
277 volumes:
278 - ./backend:/backend
279 networks:
280 - redis-network
281 depends_on:
282 - redis
283 environment:
284 - is_docker=1
285 privileged: true
286 tty: true
287 platform: linux/amd64
288
289 redis:
290 image: redis:alpine
291 command: redis-server --port 6379
292 container_name: redis_server
293 hostname: redis_server
294 labels:
295 - "name=redis"
296 - "mode=standalone"
297 networks:
298 - redis-network
299 expose:
300 - "6379"
301 tty: true
302
303networks:
304 redis-network:
305chrome_options = Options()
306chrome_options.add_argument('--no-sandbox')
307chrome_options.add_argument('--disable-dev-shm-usage')
308chrome_options.add_argument('--disable-gpu')
309chrome_options.add_argument("--remote-debugging-port=9222")
310
ANSWER
Answered 2021-Nov-01 at 05:10I think that there's no way to use chrome/chromium on m1 docker.
- no binary for chrome arm64 linux
- when running chrome on amd64 container with m1 host crashes - docker docs
- chromium could be installed using snap, but snap service not running on docker (without snap, having 127 error because binary from apt is empty) - issue report
Chromium supports arm ubuntu; I tried using chromium instead of chrome.
But chromedriver officially does not support arm64; I used unofficial binary on electron release. https://stackoverflow.com/a/57586200/11853111
BypassingFinally, I've decided to use gechodriver and firefox while using docker.
It seamlessly works regardless of host/container architecture.
QUESTION
How do I pass in arguments non-interactive into a bash file that uses "read"?
Asked 2021-Oct-27 at 02:58I have the following shell script:
1#! /bin/bash
2
3. shell_functions/commonShellFunctions.sh
4. shell_functions/settings.sh
5_valid_url=1
6
7echo "Welcome to HATS Accessibility Testing Tool!"
8echo "We recommend using Chrome browser for the best experience."
9echo "What would you like to scan today?"
10
11options=("sitemap file containing links" "website")
12
13select opt in "${options[@]}"
14do
15 case $opt in
16
17 "sitemap file containing links")
18
19 scanType="sitemap"
20 crawler=crawlSitemap
21 prompt_message="Please enter URL to sitemap: "
22 break;;
23
24 "website")
25
26 prompt_website
27 break;;
28
29 "exit")
30 exit;;
31
32 *)
33 echo "Invalid option $REPLY";;
34
35 esac
36
37done
38
39read -p "$prompt_message" page
40
41
42echo $page
43
It was meant to prompt the user, however I wish to use the script in a CI setting where I pass the arguments through the console without prompting.
I'm currently using echo "<ARG1>\n<ARG2>" | bash run.sh
, but I'm wondering if there's a better way to do this.
ANSWER
Answered 2021-Oct-27 at 02:58Use a here-document
1#! /bin/bash
2
3. shell_functions/commonShellFunctions.sh
4. shell_functions/settings.sh
5_valid_url=1
6
7echo "Welcome to HATS Accessibility Testing Tool!"
8echo "We recommend using Chrome browser for the best experience."
9echo "What would you like to scan today?"
10
11options=("sitemap file containing links" "website")
12
13select opt in "${options[@]}"
14do
15 case $opt in
16
17 "sitemap file containing links")
18
19 scanType="sitemap"
20 crawler=crawlSitemap
21 prompt_message="Please enter URL to sitemap: "
22 break;;
23
24 "website")
25
26 prompt_website
27 break;;
28
29 "exit")
30 exit;;
31
32 *)
33 echo "Invalid option $REPLY";;
34
35 esac
36
37done
38
39read -p "$prompt_message" page
40
41
42echo $page
43./run.sh <<EOF
44arg1
45arg2
46EOF
47
QUESTION
Scrapy crawls duplicate data
Asked 2021-Oct-26 at 12:51unfortunately I currently have a problem with Scrapy. I am still new to Scrapy and would like to scrap information on Rolex watches. I started with the site Watch.de, where I first go through the Rolex site and want to open the individual watches to get the exact information. However, when I start the crawler I see that many watches are crawled several times. I assume that these are the watches from the "Recently viewed" and "Our new arrivals" points. Is there a way to ignore these duplicates?
that's my code
1class WatchbotSpider(scrapy.Spider):
2name = 'watchbot'
3start_urls = ['https://www.watch.de/germany/rolex.html']
4
5 def parse(self, response, **kwargs):
6 for link in response.css('div.product-item-link a::attr(href)'):
7 yield response.follow(link.get(), callback=self.parse_categories)
8 def parse_categories(self, response):
9
10 for product in response.css('div.product-item-link'):
11 yield {
12 'id': product.css('span.product-item-id.product-item-ref::text').get(),
13 'brand': product.css('div.product-item-brand::text').get(),
14 'model': product.css('div.product-item-model::text').get(),
15 'price': product.css('span.price::text').get(),
16 'year': product.css('span.product-item-id.product-item-year::text').get()
17
ANSWER
Answered 2021-Oct-26 at 12:50This works,
1class WatchbotSpider(scrapy.Spider):
2name = 'watchbot'
3start_urls = ['https://www.watch.de/germany/rolex.html']
4
5 def parse(self, response, **kwargs):
6 for link in response.css('div.product-item-link a::attr(href)'):
7 yield response.follow(link.get(), callback=self.parse_categories)
8 def parse_categories(self, response):
9
10 for product in response.css('div.product-item-link'):
11 yield {
12 'id': product.css('span.product-item-id.product-item-ref::text').get(),
13 'brand': product.css('div.product-item-brand::text').get(),
14 'model': product.css('div.product-item-model::text').get(),
15 'price': product.css('span.price::text').get(),
16 'year': product.css('span.product-item-id.product-item-year::text').get()
17import scrapy
18
19
20class WatchbotSpider(scrapy.Spider):
21
22 name = 'watchbot'
23 start_urls = ['https://www.watch.de/germany/rolex.html']
24
25 def parse(self, response, **kwargs):
26
27 for link in response.css('div.product-item-link a::attr(href)'):
28 yield response.follow(link.get(), callback=self.parse_categories)
29
30 def parse_categories(self, response):
31
32 Dict = {
33 'id': response.xpath('//div[@class="product-ref-item product-ref d-flex align-items-center"]/span/text()').get(),
34 'brand': response.css('div.product-item-brand::text').get(),
35 'model': response.xpath('//h1[@class="product-name"]/text()').get(),
36 'price': response.css('span.price::text').get().replace(u'\xa0', u' '),
37 'year': response.xpath('//div[@class="product-item-date product-item-option"]/span/text()').get(),
38 }
39
40 print(Dict)
41 yield Dict
42
43
1class WatchbotSpider(scrapy.Spider):
2name = 'watchbot'
3start_urls = ['https://www.watch.de/germany/rolex.html']
4
5 def parse(self, response, **kwargs):
6 for link in response.css('div.product-item-link a::attr(href)'):
7 yield response.follow(link.get(), callback=self.parse_categories)
8 def parse_categories(self, response):
9
10 for product in response.css('div.product-item-link'):
11 yield {
12 'id': product.css('span.product-item-id.product-item-ref::text').get(),
13 'brand': product.css('div.product-item-brand::text').get(),
14 'model': product.css('div.product-item-model::text').get(),
15 'price': product.css('span.price::text').get(),
16 'year': product.css('span.product-item-id.product-item-year::text').get()
17import scrapy
18
19
20class WatchbotSpider(scrapy.Spider):
21
22 name = 'watchbot'
23 start_urls = ['https://www.watch.de/germany/rolex.html']
24
25 def parse(self, response, **kwargs):
26
27 for link in response.css('div.product-item-link a::attr(href)'):
28 yield response.follow(link.get(), callback=self.parse_categories)
29
30 def parse_categories(self, response):
31
32 Dict = {
33 'id': response.xpath('//div[@class="product-ref-item product-ref d-flex align-items-center"]/span/text()').get(),
34 'brand': response.css('div.product-item-brand::text').get(),
35 'model': response.xpath('//h1[@class="product-name"]/text()').get(),
36 'price': response.css('span.price::text').get().replace(u'\xa0', u' '),
37 'year': response.xpath('//div[@class="product-item-date product-item-option"]/span/text()').get(),
38 }
39
40 print(Dict)
41 yield Dict
42
43scrapy crawl watchbot > log
44
In log,
1class WatchbotSpider(scrapy.Spider):
2name = 'watchbot'
3start_urls = ['https://www.watch.de/germany/rolex.html']
4
5 def parse(self, response, **kwargs):
6 for link in response.css('div.product-item-link a::attr(href)'):
7 yield response.follow(link.get(), callback=self.parse_categories)
8 def parse_categories(self, response):
9
10 for product in response.css('div.product-item-link'):
11 yield {
12 'id': product.css('span.product-item-id.product-item-ref::text').get(),
13 'brand': product.css('div.product-item-brand::text').get(),
14 'model': product.css('div.product-item-model::text').get(),
15 'price': product.css('span.price::text').get(),
16 'year': product.css('span.product-item-id.product-item-year::text').get()
17import scrapy
18
19
20class WatchbotSpider(scrapy.Spider):
21
22 name = 'watchbot'
23 start_urls = ['https://www.watch.de/germany/rolex.html']
24
25 def parse(self, response, **kwargs):
26
27 for link in response.css('div.product-item-link a::attr(href)'):
28 yield response.follow(link.get(), callback=self.parse_categories)
29
30 def parse_categories(self, response):
31
32 Dict = {
33 'id': response.xpath('//div[@class="product-ref-item product-ref d-flex align-items-center"]/span/text()').get(),
34 'brand': response.css('div.product-item-brand::text').get(),
35 'model': response.xpath('//h1[@class="product-name"]/text()').get(),
36 'price': response.css('span.price::text').get().replace(u'\xa0', u' '),
37 'year': response.xpath('//div[@class="product-item-date product-item-option"]/span/text()').get(),
38 }
39
40 print(Dict)
41 yield Dict
42
43scrapy crawl watchbot > log
44{'id': '278240', 'brand': 'Rolex ', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 31mm - Ungetragen ', 'price': '8.118 €', 'year': '2021'}
45{'id': '116201', 'brand': 'Rolex', 'model': 'Rolex Datejust - Stahl / Roségold - Armband Stahl / Roségold / Oyster - 36mm - Wie neu ', 'price': '14.545 €', 'year': '2018'}
46{'id': '126622', 'brand': 'Rolex', 'model': 'Rolex Yacht-Master - Stahl / Platin - Armband Edelstahl / Oyster - 40mm - Ungetragen ', 'price': '15.995 €', 'year': '2020'}
47{'id': '124300', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual - Edelstahl - Armband Edelstahl / Oyster - 41mm - Ungetragen ', 'price': '9.898 €', 'year': '2021'}
48{'id': '116500LN', 'brand': 'Rolex', 'model': 'Rolex Daytona - Edelstahl - Armband Edelstahl / Oyster - 40mm - Wie neu ', 'price': '33.999 €', 'year': '2020'}
49{'id': '115234', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual Date Diamanten - Stahl / Weißgold - Armband Edelstahl / Oyster - 34mm - Ungetragen - Vintage ', 'price': '11.990 €', 'year': '2021'}
50{'id': '126200', 'brand': 'Rolex', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Jubilé - 36mm - Ungetragen ', 'price': '9.595 €', 'year': '2021'}
51{'id': '126333 ', 'brand': 'Rolex', 'model': 'Rolex Datejust - Stahl / Gelbgold - Armband Stahl / Gelbgold / Jubilé - 41mm - Wie neu ', 'price': '15.959 €', 'year': '2021'}
52{'id': '126334 ', 'brand': 'Rolex', 'model': 'Rolex Datejust Wimbledon - Stahl / Weißgold - Armband Edelstahl / Oyster - 41mm - Ungetragen ', 'price': '13.399 €', 'year': '2021'}
53{'id': '278240', 'brand': 'Rolex', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 31mm - Ungetragen ', 'price': '8.118 €', 'year': '2021'}
54.
55.
56.
57
Formating replace(" ", "") will cause some exceptions so careful formatting is the next step.
QUESTION
AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job
Asked 2021-Oct-08 at 14:53I have new to AWS Glue. I am using AWS Glue Crawler to crawl data from two S3 buckets. I have one file in each bucket. AWS Glue Crawler creates two tables in AWS Glue Data Catalog and I am also able to query the data in AWS Athena.
My understanding was in order to get data in Athena I need to create Glue job and that will pull the data in Athena but I was wrong. Is it correct to say that Glue crawler places data in Athena without the need of Glue job and if we need to push our data in DB like SQL , Oracle etc. then we need to Glue Job ?
How I can configure the Glue Crawler that it fetches only the delta data and not all data all the time from the source bucket ?
Any help is appreciated ?
ANSWER
Answered 2021-Oct-08 at 14:53The Glue crawler is only used to identify the schema that your data is in. Your data sits somewhere (e.g. S3) and the crawler identifies the schema by going through a percentage of your files.
You then can use a query engine like Athena (managed, serverless Apache Presto) to query the data, since it already has a schema.
If you want to process / clean / aggregate the data, you can use Glue Jobs, which is basically managed serverless Spark.
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Crawler
Tutorials and Learning Resources are not available at this moment for Crawler