By continuing you indicate that you have read and agree to our Terms of service and Privacy policy
by TableUV C Version: Current License: GPL-3.0
by TableUV C Version: Current License: GPL-3.0
Download this library from
Support
Quality
Security
License
Reuse
Coming Soon for all Libraries!
Currently covering the most popular Java, JavaScript and Python libraries. See a SAMPLE HERE.
kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
A Palm-sized Autonomous Table-top Sanitization Robot, website: https://tableuv.github.io/
No Code Snippets are available at this moment for TableUV.
QUESTION
How to test form submission with wrong values using Symfony crawler component and PHPUnit?
Asked 2022-Apr-05 at 11:18When you're using the app through the browser, you send a bad value, the system checks for errors in the form, and if something goes wrong (it does in this case), it redirects with a default error message written below the incriminated field.
This is the behaviour I am trying to assert with my test case, but I came accross an \InvalidArgumentException I was not expecting.
I am using the symfony/phpunit-bridge with phpunit/phpunit v8.5.23 and symfony/dom-crawler v5.3.7. Here's a sample of what it looks like :
public function testPayloadNotRespectingFieldLimits(): void
{
$client = static::createClient();
/** @var SomeRepository $repo */
$repo = self::getContainer()->get(SomeRepository::class);
$countEntries = $repo->count([]);
$crawler = $client->request(
'GET',
'/route/to/form/add'
);
$this->assertResponseIsSuccessful(); // Goes ok.
$form = $crawler->filter('[type=submit]')->form(); // It does retrieve my form node.
// This is where it's not working.
$form->setValues([
'some[name]' => 'Someokvalue',
'some[color]' => 'SomeNOTOKValue', // It is a ChoiceType with limited values, where 'SomeNOTOKValue' does not belong. This is the line that throws an \InvalidArgumentException.
)];
// What I'd like to assert after this
$client->submit($form);
$this->assertResponseRedirects();
$this->assertEquals($countEntries, $repo->count([]));
}
Here's the exception message I get :
InvalidArgumentException: Input "some[color]" cannot take "SomeNOTOKValue" as a value (possible values: "red", "pink", "purple", "white").
vendor/symfony/dom-crawler/Field/ChoiceFormField.php:140
vendor/symfony/dom-crawler/FormFieldRegistry.php:113
vendor/symfony/dom-crawler/Form.php:75
The ColorChoiceType tested here is pretty standard :
public function configureOptions(OptionsResolver $resolver): void
{
$resolver->setDefaults([
'choices' => ColorEnumType::getChoices(),
'multiple' => false,
)];
}
What I can do, is to wrap in a try-catch block, the line where it sets the wrong value. And it would indeed submit the form and proceed to the next assertion. The issue here is that the form was considered submitted and valid, it forced an appropriate value for the color field (the first choice of the enum set). This is not what I get when I try this in my browser (cf. the intro).
// ...
/** @var SomeRepository $repo */
$repo = self::getContainer()->get(SomeRepository::class);
$countEntries = $repo->count([]); // Gives 0.
// ...
try {
$form->setValues([
'some[name]' => 'Someokvalue',
'some[color]' => 'SomeNOTOKValue',
]);
} catch (\InvalidArgumentException $e) {}
$client->submit($form); // Now it submits the form.
$this->assertResponseRedirects(); // Ok.
$this->assertEquals($countEntries, $repo->count([])); // Failed asserting that 1 matches expected 0. !!
How can I mimic the browser behaviour in my test case and make asserts on it ?
ANSWER
Answered 2022-Apr-05 at 11:17It seems that you can disable validation on the DomCrawler\Form component. Based on the official documentation here.
So doing this, now works as expected :
$form = $crawler->filter('[type=submit]')->form()->disableValidation();
$form->setValues([
'some[name]' => 'Someokvalue',
'some[color]' => 'SomeNOTOKValue',
];
$client->submit($form);
$this->assertEquals($entriesBefore, $repo->count([]); // Now passes.
QUESTION
Setting proxies when crawling websites with Python
Asked 2022-Mar-12 at 18:30I want to set proxies to my crawler. I'm using requests module and Beautiful Soup. I have found a list of API links that provide free proxies with 4 types of protocols.
All proxies with 3/4 protocols work (HTTP, SOCKS4, SOCKS5) except one, and thats proxies with HTTPS protocol. This is my code:
from bs4 import BeautifulSoup
import requests
import random
import json
# LIST OF FREE PROXY APIS, THESE PROXIES ARE LAST TIME TESTED 50 MINUTES AGO, PROTOCOLS: HTTP, HTTPS, SOCKS4 AND SOCKS5
list_of_proxy_content = ["https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=CH&protocols=http%2Chttps%2Csocks4%2Csocks5",
"https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=FR&protocols=http%2Chttps%2Csocks4%2Csocks5",
"https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=DE&protocols=http%2Chttps%2Csocks4%2Csocks5",
"https://proxylist.geonode.com/api/proxy-list?limit=1500&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=AT&protocols=http%2Chttps%2Csocks4%2Csocks5",
"https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=IT&protocols=http%2Chttps%2Csocks4%2Csocks5"]
# EXTRACTING JSON DATA FROM THIS LIST OF PROXIES
full_proxy_list = []
for proxy_url in list_of_proxy_content:
proxy_json = requests.get(proxy_url).text
proxy_json = json.loads(proxy_json)
proxy_json = proxy_json["data"]
full_proxy_list.extend(proxy_json)
# CREATING PROXY DICT
final_proxy_list = []
for proxy in full_proxy_list:
#print(proxy) # JSON VALUE FOR ALL DATA THAT GOES INTO PROXY
protocol = proxy['protocols'][0]
ip_ = proxy['ip']
port = proxy['port']
proxy = {protocol : protocol + '://' + ip_ + ':' + port}
final_proxy_list.append(proxy)
# TRYING PROXY ON 3 DIFERENT WEBSITES
for proxy in final_proxy_list:
print(proxy)
try:
r0 = requests.get("https://edition.cnn.com/", proxies=proxy, timeout = 15)
if r0.status_code == 200:
print("GOOD PROXY")
else:
print("BAD PROXY")
except:
print("proxy error")
try:
r1 = requests.get("https://www.buelach.ch/", proxies=proxy, timeout = 15)
if r1.status_code == 200:
print("GOOD PROXY")
else:
print("BAD PROXY")
except:
print("proxy error")
try:
r2 = requests.get("https://www.blog.police.be.ch/", proxies=proxy, timeout = 15)
if r2.status_code == 200:
print("GOOD PROXY")
else:
print("BAD PROXY")
except:
print("proxy error")
print()
My question is, why does HTTPS proxies do not work, what am I doing wrong?
My proxies look like this:
{'socks4': 'socks4://185.168.173.35:5678'}
{'http': 'http://62.171.177.80:3128'}
{'https': 'http://159.89.28.169:3128'}
I have seen that sometimes people pass proxies like this:
proxies = {"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080"}
But this dict has 2 protocols, but in links its only http, why? Can I pass only one, can I pass 10 different IP addresses in this dict?
ANSWER
Answered 2021-Sep-17 at 16:08I did some research on the topic and now I'm confused why you want a proxy for HTTPS.
While it is understandable to want a proxy for HTTP, (HTTP is unencrypted) HTTPS is secure.
Could it be possible your proxy is not connecting because you don't need one?
I am not a proxy expert, so I apologize if I'm putting out something completely stupid.
I don't want to leave you completely empty-handed though. If you are looking for complete privacy, I would suggest a VPN. Both Windscribe and RiseUpVPN are free and encrypt all your data on your computer. (The desktop version, not the browser extension.)
While this is not a fully automated process, it is still very effective.
QUESTION
Can't Successfully Run AWS Glue Job That Reads From DynamoDB
Asked 2022-Feb-07 at 10:49I have successfully run crawlers that read my table in Dynamodb and also in AWS Reshift. The tables are now in the catalog. My problem is when running the Glue job to read the data from Dynamodb to Redshift. It doesnt seem to be able to read from Dynamodb. The error logs contain this
2022-02-01 10:16:55,821 WARN [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(69)): Lost task 0.0 in stage 0.0 (TID 0) (172.31.74.37 executor 1): java.lang.RuntimeException: Could not lookup table <TABLE-NAME> in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:143)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:610)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:132)
... 23 more
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6164)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6131)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2228)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2193)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:136)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:133)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
... 24 more
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
... 38 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
... 53 more
and the complete logs contain this:
22/02/01 10:06:07 INFO GlueContext: Glue secret manager integration: secretId is not provided.
The role that Glue has been given has administrator access.
Below is the code for the script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="db_s3_table",
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("column1.s", "string", "column1", "string"),
("column2.n", "string", "column2", "long"),
("column3.s", "string", "column3", "string"),
("partition_0", "string", "partition0", "string"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Redshift Cluster
RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
frame=ApplyMapping_node2,
database="db",
table_name="db_redshift_db_schema_table",
redshift_tmp_dir=args["TempDir"],
transformation_ctx="RedshiftCluster_node3",
)
job.commit()
ANSWER
Answered 2022-Feb-07 at 10:49It seems that you were missing a VPC Endpoint for DynamoDB, since your Glue Jobs run in a private VPC when you write to Redshift.
QUESTION
Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
Asked 2022-Jan-22 at 16:39I have the following scrapy CrawlSpider
:
import logger as lg
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashTextResponse
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
logger = lg.get_logger("oddsportal_spider")
class SeleniumScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"DOWNLOADER_MIDDLEWARES": {
'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
},
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
),
callback="parse_match",
),
)
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
process = CrawlerProcess()
process.crawl(SeleniumScraper)
process.start()
The Selenium middleware is as follows:
class SeleniumMiddleware:
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
logger.debug(f"Selenium processing request - {request.url}")
self.driver.get(request.url)
return HtmlResponse(
request.url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
def spider_opened(self, spider):
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
self.driver = webdriver.Firefox(
options=options,
executable_path=Path("/opt/geckodriver/geckodriver"),
)
def spider_closed(self, spider):
self.driver.close()
End to end this takes around a minute for around 50ish pages. To try and speed things up and take advantage of multiple threads and Javascript I've implemented the following scrapy_splash spider:
class SplashScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"SPLASH_URL": "http://localhost:8050",
"DOWNLOADER_MIDDLEWARES": {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
"SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
"DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
process_request="use_splash",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
),
callback="parse_match",
process_request="use_splash",
),
)
def process_links(self, links):
for link in links:
link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url})
return links
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse)):
return
seen = set()
for rule_index, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
for link in rule.process_links(links):
seen.add(link)
request = self._build_request(rule_index, link)
yield rule.process_request(request, response)
def use_splash(self, request, response):
request.meta.update(splash={'endpoint': 'render.html'})
return request
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
However, this takes about the same amount of time. I was hoping to see a big increase in speed :(
I've tried playing around with different DOWNLOAD_DELAY
settings but that hasn't made things any faster.
All the concurrency settings are left at their defaults.
Any ideas on if/how I'm going wrong?
ANSWER
Answered 2022-Jan-22 at 16:39Taking a stab at an answer here with no experience of the libraries.
It looks like Scrapy Crawlers themselves are single-threaded. To get multi-threaded behavior you need to configure your application differently or write code that makes it behave mulit-threaded. It sounds like you've already tried this so this is probably not news to you but make sure you have configured the CONCURRENT_REQUESTS
and REACTOR_THREADPOOL_MAXSIZE
.
https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize
I can't imagine there is much CPU work going on in the crawling process so i doubt it's a GIL issue.
Excluding GIL as an option there are two possibilities here:
To test this, create a global object and store a counter on it. Each time your crawler starts a request increment the counter. Each time your crawler finishes a request, decrement the counter. Then run a thread that prints the counter every second. If your counter value is always 1, then you are still running synchronously.
# global_state.py
GLOBAL_STATE = {"counter": 0}
# middleware.py
from global_state import GLOBAL_STATE
class SeleniumMiddleware:
def process_request(self, request, spider):
GLOBAL_STATE["counter"] += 1
self.driver.get(request.url)
GLOBAL_STATE["counter"] -= 1
...
# main.py
from global_state import GLOBAL_STATE
import threading
import time
def main():
gst = threading.Thread(target=gs_watcher)
gst.start()
# Start your app here
def gs_watcher():
while True:
print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
time.sleep(1)
To test this, run the application multiple times. If you go from 50 req/s to 25 req/s per application then you are being rate limited. To skirt around this use a VPN to hop-around.
If after that you find that you are running concurrent requests, and you are not being rate limited, then there is something funky going on in the libraries. Try removing chunks of code until you get to the bare minimum of what you need to crawl. If you have gotten to the absolute bare minimum implementation and it's still slow then you now have a minimal reproducible example and can get much better/informed help.
QUESTION
How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
Asked 2022-Jan-20 at 15:35I am working on certain stock-related projects where I have had a task to scrape all data on a daily basis for the last 5 years. i.e from 2016 to date. I particularly thought of using selenium because I can use crawler and bot to scrape the data based on the date. So I used the use of button click with selenium and now I want the same data that is displayed by the selenium browser to be fed by scrappy. This is the website I am working on right now. I have written the following code inside scrappy spider.
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
def start_requests(self):
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']
for date in floorsheet_dates:
driver.get(
"https://merolagani.com/Floorsheet.aspx")
driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
total_length = driver.find_element(By.XPATH,
"//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(z, z + 1):
driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
self.url = driver.page_source
yield Request(url=self.url, callback=self.parse)
def parse(self, response, **kwargs):
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
Update: Error after answer is
2022-01-14 14:11:36 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
result = current_context.run(gen.send, result)
File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable
I want to send current web html content to scrapy feeder but I am getting unusal error for past 2 days any help or suggestions will be very much appreciated.
ANSWER
Answered 2022-Jan-14 at 09:30The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.
Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from scrapy.http import HtmlResponse
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
def start_requests(self):
# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver = webdriver.Chrome()
floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
for date in floorsheet_dates:
driver.get(
"https://merolagani.com/Floorsheet.aspx")
driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
total_length = driver.find_element(By.XPATH,
"//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
self.body = driver.page_source
response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
# return an empty requests list
return []
Solution 2 - with super simple downloader middleware:
(You might have a delay here in parse
method so be patient).
import scrapy
from scrapy import Request
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.common.by import By
class SeleniumMiddleware(object):
def process_request(self, request, spider):
url = spider.driver.current_url
body = spider.driver.page_source
return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
# 'projects_name.path.to.your.pipeline': 543
}
}
driver = webdriver.Chrome()
def start_requests(self):
# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
for date in floorsheet_dates:
self.driver.get(
"https://merolagani.com/Floorsheet.aspx")
self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
total_length = self.driver.find_element(By.XPATH,
"//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
self.body = self.driver.page_source
self.url = self.driver.current_url
yield Request(url=self.url, callback=self.parse, dont_filter=True)
def parse(self, response, **kwargs):
print('test ok')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
Notice that I've used chrome so change it back to firefox like in your original code.
QUESTION
How to set class variable through __init__ in Python?
Asked 2021-Nov-08 at 20:06I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method, but I could not figure out how to change the class varible "delay" from within the init method.
Example minimal:
class testSpider(CrawlSpider):
custom_settings = {
'DOWNLOAD_DELAY': 10, # default value
}
""" get arguments passed over CLI
scrapyd usage: -d arg1=val1
scrapy usage: -a arg1=val1
"""
def __init__(self, *args, **kwargs):
super(testSpider, self).__init__(*args, **kwargs)
self.delay = kwargs.get('delay')
if self.delay:
testSpider.custom_settings['DOWNLOAD_DELAY'] = self.delay
print('init:', testSpider.custom_settings['DOWNLOAD_DELAY'])
print(custom_settings['DOWNLOAD_DELAY'])
This will not change the setting unfortunatelly:
scrapy crawl test -a delay=5
10
init: 5
How can the class variable be changed?
ANSWER
Answered 2021-Nov-08 at 20:06I am trying to change setting from command line while starting a scrapy crawler (Python 3.7). Therefore I am adding a init method...
...
scrapy crawl test -a delay=5
According to scrapy docs. (Settings/Command line options section)
it is requred to use -s
parameter to update setting
scrapy crawl test -s DOWNLOAD_DELAY=5
It is not possible to update settings during runtime in spider code from init
or other methods (details in related discussion on github Update spider settings during runtime #4196
QUESTION
headless chrome on docker M1 error - unable to discover open window in chrome
Asked 2021-Nov-04 at 08:22I'm currently trying to run headless chrome with selenium on m1 mac host / amd64 ubuntu container.
Because arm ubuntu does not support google-chrome-stable package, I decided to use amd64 ubuntu base image.
But it does not work. getting some error.
worker_1 | [2021-10-31 03:58:23,286: DEBUG/ForkPoolWorker-10] POST http://localhost:43035/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}, "desiredCapabilities": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"extensions": [], "args": ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu", "--remote-debugging-port=9222", "--headless"]}}}
worker_1 | [2021-10-31 03:58:23,330: DEBUG/ForkPoolWorker-10] Starting new HTTP connection (1): localhost:43035
worker_1 | [2021-10-31 03:58:41,311: DEBUG/ForkPoolWorker-12] http://localhost:47089 "POST /session HTTP/1.1" 500 717
worker_1 | [2021-10-31 03:58:41,412: DEBUG/ForkPoolWorker-12] Finished Request
worker_1 | [2021-10-31 03:58:41,825: WARNING/ForkPoolWorker-12] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
worker_1 | (Session info: headless chrome=95.0.4638.69)
worker_1 | Stacktrace:
worker_1 | #0 0x004000a18f93 <unknown>
worker_1 | #1 0x0040004f3908 <unknown>
worker_1 | #2 0x0040004d3cdf <unknown>
worker_1 | #3 0x00400054cabe <unknown>
worker_1 | #4 0x004000546973 <unknown>
worker_1 | #5 0x00400051cdf4 <unknown>
worker_1 | #6 0x00400051dde5 <unknown>
worker_1 | #7 0x004000a482be <unknown>
worker_1 | #8 0x004000a5dba0 <unknown>
worker_1 | #9 0x004000a49215 <unknown>
worker_1 | #10 0x004000a5efe8 <unknown>
worker_1 | #11 0x004000a3d9db <unknown>
worker_1 | #12 0x004000a7a218 <unknown>
worker_1 | #13 0x004000a7a398 <unknown>
worker_1 | #14 0x004000a956cd <unknown>
worker_1 | #15 0x004002b29609 <unknown>
worker_1 |
worker_1 | [2021-10-31 03:58:41,826: WARNING/ForkPoolWorker-12]
worker_1 |
worker_1 | [2021-10-31 03:58:41,867: DEBUG/ForkPoolWorker-11] http://localhost:58147 "POST /session HTTP/1.1" 500 717
worker_1 | [2021-10-31 03:58:41,907: DEBUG/ForkPoolWorker-11] Finished Request
worker_1 | [2021-10-31 03:58:41,946: DEBUG/ForkPoolWorker-12] Using selector: EpollSelector
worker_1 | [WDM] -
worker_1 |
worker_1 | [2021-10-31 03:58:41,962: INFO/ForkPoolWorker-12]
worker_1 |
worker_1 | [WDM] - ====== WebDriver manager ======
worker_1 | [2021-10-31 03:58:41,971: INFO/ForkPoolWorker-12] ====== WebDriver manager ======
worker_1 | [2021-10-31 03:58:42,112: WARNING/ForkPoolWorker-11] Error occurred while initializing chromedriver - Message: unknown error: unable to discover open window in chrome
worker_1 | (Session info: headless chrome=95.0.4638.69)
worker_1 | Stacktrace:
worker_1 | #0 0x004000a18f93 <unknown>
worker_1 | #1 0x0040004f3908 <unknown>
worker_1 | #2 0x0040004d3cdf <unknown>
worker_1 | #3 0x00400054cabe <unknown>
worker_1 | #4 0x004000546973 <unknown>
worker_1 | #5 0x00400051cdf4 <unknown>
worker_1 | #6 0x00400051dde5 <unknown>
worker_1 | #7 0x004000a482be <unknown>
worker_1 | #8 0x004000a5dba0 <unknown>
worker_1 | #9 0x004000a49215 <unknown>
worker_1 | #10 0x004000a5efe8 <unknown>
worker_1 | #11 0x004000a3d9db <unknown>
worker_1 | #12 0x004000a7a218 <unknown>
worker_1 | #13 0x004000a7a398 <unknown>
worker_1 | #14 0x004000a956cd <unknown>
worker_1 | #15 0x004002b29609 <unknown>
worker_1 |
worker_1 | [2021-10-31 03:58:42,113: WARNING/ForkPoolWorker-11]
worker_1 |
worker_1 | [2021-10-31 03:58:42,166: DEBUG/ForkPoolWorker-11] Using selector: EpollSelector
worker_1 | [WDM] -
worker_1 |
worker_1 | [2021-10-31 03:58:42,169: INFO/ForkPoolWorker-11]
worker_1 |
worker_1 | [WDM] - ====== WebDriver manager ======
worker_1 | [2021-10-31 03:58:42,170: INFO/ForkPoolWorker-11] ====== WebDriver manager ======
worker_1 | [2021-10-31 03:58:42,702: DEBUG/ForkPoolWorker-9] http://localhost:51793 "POST /session HTTP/1.1" 500 866
worker_1 | [2021-10-31 03:58:42,719: DEBUG/ForkPoolWorker-9] Finished Request
worker_1 | [2021-10-31 03:58:42,986: WARNING/ForkPoolWorker-9] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
worker_1 | (chrome not reachable)
worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
worker_1 | Stacktrace:
worker_1 | #0 0x004000a18f93 <unknown>
worker_1 | #1 0x0040004f3908 <unknown>
worker_1 | #2 0x004000516b32 <unknown>
worker_1 | #3 0x00400051265d <unknown>
worker_1 | #4 0x00400054c770 <unknown>
worker_1 | #5 0x004000546973 <unknown>
worker_1 | #6 0x00400051cdf4 <unknown>
worker_1 | #7 0x00400051dde5 <unknown>
worker_1 | #8 0x004000a482be <unknown>
worker_1 | #9 0x004000a5dba0 <unknown>
worker_1 | #10 0x004000a49215 <unknown>
worker_1 | #11 0x004000a5efe8 <unknown>
worker_1 | #12 0x004000a3d9db <unknown>
worker_1 | #13 0x004000a7a218 <unknown>
worker_1 | #14 0x004000a7a398 <unknown>
worker_1 | #15 0x004000a956cd <unknown>
worker_1 | #16 0x004002b29609 <unknown>
worker_1 |
worker_1 | [2021-10-31 03:58:42,987: WARNING/ForkPoolWorker-9]
worker_1 |
worker_1 | [2021-10-31 03:58:43,045: DEBUG/ForkPoolWorker-9] Using selector: EpollSelector
worker_1 | [WDM] -
worker_1 |
worker_1 | [2021-10-31 03:58:43,049: INFO/ForkPoolWorker-9]
worker_1 |
worker_1 | [WDM] - ====== WebDriver manager ======
worker_1 | [2021-10-31 03:58:43,050: INFO/ForkPoolWorker-9] ====== WebDriver manager ======
worker_1 | [2021-10-31 03:58:43,936: DEBUG/ForkPoolWorker-10] http://localhost:43035 "POST /session HTTP/1.1" 500 866
worker_1 | [2021-10-31 03:58:43,952: DEBUG/ForkPoolWorker-10] Finished Request
worker_1 | [2021-10-31 03:58:44,163: WARNING/ForkPoolWorker-10] Error occurred while initializing chromedriver - Message: unknown error: Chrome failed to start: crashed.
worker_1 | (chrome not reachable)
worker_1 | (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
worker_1 | Stacktrace:
worker_1 | #0 0x004000a18f93 <unknown>
worker_1 | #1 0x0040004f3908 <unknown>
worker_1 | #2 0x004000516b32 <unknown>
worker_1 | #3 0x00400051265d <unknown>
worker_1 | #4 0x00400054c770 <unknown>
worker_1 | #5 0x004000546973 <unknown>
worker_1 | #6 0x00400051cdf4 <unknown>
worker_1 | #7 0x00400051dde5 <unknown>
worker_1 | #8 0x004000a482be <unknown>
worker_1 | #9 0x004000a5dba0 <unknown>
worker_1 | #10 0x004000a49215 <unknown>
worker_1 | #11 0x004000a5efe8 <unknown>
worker_1 | #12 0x004000a3d9db <unknown>
worker_1 | #13 0x004000a7a218 <unknown>
worker_1 | #14 0x004000a7a398 <unknown>
worker_1 | #15 0x004000a956cd <unknown>
worker_1 | #16 0x004002b29609 <unknown>
worker_1 |
worker_1 | [2021-10-31 03:58:44,164: WARNING/ForkPoolWorker-10]
worker_1 |
worker_1 | [2021-10-31 03:58:44,205: DEBUG/ForkPoolWorker-10] Using selector: EpollSelector
worker_1 | [WDM] -
worker_1 |
worker_1 | [2021-10-31 03:58:44,215: INFO/ForkPoolWorker-10]
worker_1 |
worker_1 | [WDM] - ====== WebDriver manager ======
worker_1 | [2021-10-31 03:58:44,217: INFO/ForkPoolWorker-10] ====== WebDriver manager ======
worker_1 | [WDM] - Current google-chrome version is 95.0.4638
worker_1 | [2021-10-31 03:58:44,520: INFO/ForkPoolWorker-12] Current google-chrome version is 95.0.4638
worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
worker_1 | [2021-10-31 03:58:44,525: INFO/ForkPoolWorker-12] Get LATEST driver version for 95.0.4638
worker_1 | [WDM] - Current google-chrome version is 95.0.4638
worker_1 | [2021-10-31 03:58:44,590: INFO/ForkPoolWorker-11] Current google-chrome version is 95.0.4638
worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
worker_1 | [2021-10-31 03:58:44,593: INFO/ForkPoolWorker-11] Get LATEST driver version for 95.0.4638
worker_1 | [2021-10-31 03:58:44,599: DEBUG/ForkPoolWorker-12] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
worker_1 | [2021-10-31 03:58:44,826: DEBUG/ForkPoolWorker-11] Starting new HTTPS connection (1): chromedriver.storage.googleapis.com:443
worker_1 | [2021-10-31 03:58:45,205: DEBUG/ForkPoolWorker-11] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
worker_1 | [2021-10-31 03:58:45,213: DEBUG/ForkPoolWorker-12] https://chromedriver.storage.googleapis.com:443 "GET /LATEST_RELEASE_95.0.4638 HTTP/1.1" 200 12
worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
worker_1 | [2021-10-31 03:58:45,219: INFO/ForkPoolWorker-11] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
worker_1 | [WDM] - Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
worker_1 | [2021-10-31 03:58:45,242: INFO/ForkPoolWorker-12] Driver [/home/ubuntu/.wdm/drivers/chromedriver/linux64/95.0.4638.54/chromedriver] found in cache
worker_1 | [WDM] - Current google-chrome version is 95.0.4638
worker_1 | [2021-10-31 03:58:45,603: INFO/ForkPoolWorker-9] Current google-chrome version is 95.0.4638
worker_1 | [WDM] - Get LATEST driver version for 95.0.4638
worker_1 | [2021-10-31 03:58:45,610: INFO/ForkPoolWorker-9] Get LATEST driver version for 95.0.4638
similar logs are looped.
when I tried to launch chrome on docker container, this error occurs.
ubuntu@742a62c61201:/backend$ google-chrome --no-sandbox --disable-dev-shm-usage --disable-gpu --remote-debugging-port=9222 --headless
qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
[1031/041139.297323:ERROR:bus.cc(392)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[1031/041139.310612:ERROR:file_path_watcher_linux.cc(326)] inotify_init() failed: Function not implemented (38)
DevTools listening on ws://127.0.0.1:9222/devtools/browser/32b15b93-3fe0-4cb8-9c96-8aea011686a8
qemu: unknown option 'type=utility'
[1031/041139.463057:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
[1031/041139.463227:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 1 time(s)
[1031/041139.543335:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
qemu: unknown option 'type=utility'
[1031/041139.718793:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
[1031/041139.718877:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 2 time(s)
[1031/041139.736641:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
qemu: unknown option 'type=utility'
[1031/041139.788529:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
[1031/041139.788615:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 3 time(s)
[1031/041139.798487:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
[1031/041139.808256:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
[1031/041139.808372:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 4 time(s)
qemu: unknown option 'type=utility'
[1031/041139.825267:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
[1031/041139.825354:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 5 time(s)
[1031/041139.830175:ERROR:network_service_instance_impl.cc(638)] Network service crashed, restarting service.
[1031/041139.839159:ERROR:gpu_process_host.cc(973)] GPU process launch failed: error_code=1002
[1031/041139.839345:WARNING:gpu_process_host.cc(1292)] The GPU process has crashed 6 time(s)
[1031/041139.839816:FATAL:gpu_data_manager_impl_private.cc(417)] GPU process isn't usable. Goodbye.
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault
ubuntu@742a62c61201:/backend$ qemu: unknown option 'type=utility'
ubuntu@742a62c61201:/backend$
Maybe this issue related? https://github.com/docker/for-mac/issues/5766
If so, there's no way to dockerize headless chrome using m1?
celery worker Dockerfile
FROM --platform=linux/amd64 ubuntu:20.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt update -y && apt install python3.9 python3-pip python-is-python3 sudo wget -y
RUN pip install --upgrade pip
# set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
RUN adduser --disabled-password --gecos '' ubuntu
RUN adduser ubuntu sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER ubuntu
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
RUN echo "deb http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google.list
RUN sudo apt update -y && sudo apt install -y google-chrome-stable
ENV PATH="/home/ubuntu/.local/bin:$PATH"
WORKDIR /backend
COPY requirements.txt ./
RUN pip install -r requirements.txt --no-cache-dir
COPY . .
ENV DISPLAY=:99
ENTRYPOINT [ "./run-celery.sh" ]
docker-compose.yml
version: "3.3"
services:
frontend:
build:
context: ./frontend
ports:
- "3000:3000"
volumes:
- ./frontend:/frontend
depends_on:
- backend
deploy:
resources:
limits:
cpus: "2"
memory: 4G
reservations:
cpus: "0.5"
memory: 512M
tty: true
stdin_open: true
backend:
build: ./backend
ports:
- "8000:8000"
volumes:
- ./backend:/backend
networks:
- redis-network
depends_on:
- redis
- worker
environment:
- is_docker=1
deploy:
resources:
limits:
cpus: "2"
memory: 4G
reservations:
cpus: "0.5"
memory: 512M
tty: true
worker:
build:
context: ./backend
dockerfile: ./celery-dockerfile/Dockerfile
deploy:
resources:
limits:
cpus: "2"
memory: 4G
reservations:
cpus: "0.5"
memory: 4G
volumes:
- ./backend:/backend
networks:
- redis-network
depends_on:
- redis
environment:
- is_docker=1
privileged: true
tty: true
platform: linux/amd64
redis:
image: redis:alpine
command: redis-server --port 6379
container_name: redis_server
hostname: redis_server
labels:
- "name=redis"
- "mode=standalone"
networks:
- redis-network
expose:
- "6379"
tty: true
networks:
redis-network:
Crawler full code from AutoCrawler repository. if you want to full crawler code, it's better checkout this code.
I've changed options during trial and error.
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--remote-debugging-port=9222")
ANSWER
Answered 2021-Nov-01 at 05:10I think that there's no way to use chrome/chromium on m1 docker.
Chromium supports arm ubuntu; I tried using chromium instead of chrome.
But chromedriver officially does not support arm64; I used unofficial binary on electron release. https://stackoverflow.com/a/57586200/11853111
Finally, I've decided to use gechodriver and firefox while using docker.
It seamlessly works regardless of host/container architecture.
QUESTION
How do I pass in arguments non-interactive into a bash file that uses "read"?
Asked 2021-Oct-27 at 02:58I have the following shell script:
#! /bin/bash
. shell_functions/commonShellFunctions.sh
. shell_functions/settings.sh
_valid_url=1
echo "Welcome to HATS Accessibility Testing Tool!"
echo "We recommend using Chrome browser for the best experience."
echo "What would you like to scan today?"
options=("sitemap file containing links" "website")
select opt in "${options[@]}"
do
case $opt in
"sitemap file containing links")
scanType="sitemap"
crawler=crawlSitemap
prompt_message="Please enter URL to sitemap: "
break;;
"website")
prompt_website
break;;
"exit")
exit;;
*)
echo "Invalid option $REPLY";;
esac
done
read -p "$prompt_message" page
echo $page
It was meant to prompt the user, however I wish to use the script in a CI setting where I pass the arguments through the console without prompting.
I'm currently using echo "<ARG1>\n<ARG2>" | bash run.sh
, but I'm wondering if there's a better way to do this.
ANSWER
Answered 2021-Oct-27 at 02:58Use a here-document
./run.sh <<EOF
arg1
arg2
EOF
QUESTION
Scrapy crawls duplicate data
Asked 2021-Oct-26 at 12:51unfortunately I currently have a problem with Scrapy. I am still new to Scrapy and would like to scrap information on Rolex watches. I started with the site Watch.de, where I first go through the Rolex site and want to open the individual watches to get the exact information. However, when I start the crawler I see that many watches are crawled several times. I assume that these are the watches from the "Recently viewed" and "Our new arrivals" points. Is there a way to ignore these duplicates?
that's my code
class WatchbotSpider(scrapy.Spider):
name = 'watchbot'
start_urls = ['https://www.watch.de/germany/rolex.html']
def parse(self, response, **kwargs):
for link in response.css('div.product-item-link a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_categories)
def parse_categories(self, response):
for product in response.css('div.product-item-link'):
yield {
'id': product.css('span.product-item-id.product-item-ref::text').get(),
'brand': product.css('div.product-item-brand::text').get(),
'model': product.css('div.product-item-model::text').get(),
'price': product.css('span.price::text').get(),
'year': product.css('span.product-item-id.product-item-year::text').get()
ANSWER
Answered 2021-Oct-26 at 12:50This works,
import scrapy
class WatchbotSpider(scrapy.Spider):
name = 'watchbot'
start_urls = ['https://www.watch.de/germany/rolex.html']
def parse(self, response, **kwargs):
for link in response.css('div.product-item-link a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_categories)
def parse_categories(self, response):
Dict = {
'id': response.xpath('//div[@class="product-ref-item product-ref d-flex align-items-center"]/span/text()').get(),
'brand': response.css('div.product-item-brand::text').get(),
'model': response.xpath('//h1[@class="product-name"]/text()').get(),
'price': response.css('span.price::text').get().replace(u'\xa0', u' '),
'year': response.xpath('//div[@class="product-item-date product-item-option"]/span/text()').get(),
}
print(Dict)
yield Dict
scrapy crawl watchbot > log
In log,
{'id': '278240', 'brand': 'Rolex ', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 31mm - Ungetragen ', 'price': '8.118 €', 'year': '2021'}
{'id': '116201', 'brand': 'Rolex', 'model': 'Rolex Datejust - Stahl / Roségold - Armband Stahl / Roségold / Oyster - 36mm - Wie neu ', 'price': '14.545 €', 'year': '2018'}
{'id': '126622', 'brand': 'Rolex', 'model': 'Rolex Yacht-Master - Stahl / Platin - Armband Edelstahl / Oyster - 40mm - Ungetragen ', 'price': '15.995 €', 'year': '2020'}
{'id': '124300', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual - Edelstahl - Armband Edelstahl / Oyster - 41mm - Ungetragen ', 'price': '9.898 €', 'year': '2021'}
{'id': '116500LN', 'brand': 'Rolex', 'model': 'Rolex Daytona - Edelstahl - Armband Edelstahl / Oyster - 40mm - Wie neu ', 'price': '33.999 €', 'year': '2020'}
{'id': '115234', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual Date Diamanten - Stahl / Weißgold - Armband Edelstahl / Oyster - 34mm - Ungetragen - Vintage ', 'price': '11.990 €', 'year': '2021'}
{'id': '126200', 'brand': 'Rolex', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Jubilé - 36mm - Ungetragen ', 'price': '9.595 €', 'year': '2021'}
{'id': '126333 ', 'brand': 'Rolex', 'model': 'Rolex Datejust - Stahl / Gelbgold - Armband Stahl / Gelbgold / Jubilé - 41mm - Wie neu ', 'price': '15.959 €', 'year': '2021'}
{'id': '126334 ', 'brand': 'Rolex', 'model': 'Rolex Datejust Wimbledon - Stahl / Weißgold - Armband Edelstahl / Oyster - 41mm - Ungetragen ', 'price': '13.399 €', 'year': '2021'}
{'id': '278240', 'brand': 'Rolex', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 31mm - Ungetragen ', 'price': '8.118 €', 'year': '2021'}
.
.
.
Formating replace(" ", "") will cause some exceptions so careful formatting is the next step.
QUESTION
AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job
Asked 2021-Oct-08 at 14:53I have new to AWS Glue. I am using AWS Glue Crawler to crawl data from two S3 buckets. I have one file in each bucket. AWS Glue Crawler creates two tables in AWS Glue Data Catalog and I am also able to query the data in AWS Athena.
My understanding was in order to get data in Athena I need to create Glue job and that will pull the data in Athena but I was wrong. Is it correct to say that Glue crawler places data in Athena without the need of Glue job and if we need to push our data in DB like SQL , Oracle etc. then we need to Glue Job ?
How I can configure the Glue Crawler that it fetches only the delta data and not all data all the time from the source bucket ?
Any help is appreciated ?
ANSWER
Answered 2021-Oct-08 at 14:53The Glue crawler is only used to identify the schema that your data is in. Your data sits somewhere (e.g. S3) and the crawler identifies the schema by going through a percentage of your files.
You then can use a query engine like Athena (managed, serverless Apache Presto) to query the data, since it already has a schema.
If you want to process / clean / aggregate the data, you can use Glue Jobs, which is basically managed serverless Spark.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
No vulnerabilities reported
Save this library and start creating your kit
Explore Related Topics
Share this Page
Hospital sanitizing robot
See all related Kits
scrapy
by scrapy
cheerio
by cheeriojs
winston
by winstonjs
pyspider
by binux
colly
by gocolly
See all Crawler Libraries
scrapy
by scrapy
pyspider
by binux
crawler4j
by yasserg
webmagic
by code4craft
winston
by winstonjs
See all Crawler Libraries
Save this library and start creating your kit