Support
Quality
Security
License
Reuse
kandi has reviewed crawler4j and discovered the below as its top functions. This is intended to give you an instant insight into crawler4j implemented functionality, and help decide if they suit your requirements.
Open Source Web Crawler for Java
Using Maven
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>4.4.0</version>
</dependency>
Using Gradle
compile group: 'edu.uci.ics', name: 'crawler4j', version: '4.4.0'
Quickstart
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
+ "|png|mp3|mp4|zip|gz))$");
/**
* This method receives two parameters. The first parameter is the page
* in which we have discovered this new url and the second parameter is
* the new url. You should implement this function to specify whether
* the given url should be crawled or not (based on your crawling logic).
* In this example, we are instructing the crawler to ignore urls that
* have css, js, git, ... extensions and to only accept urls that start
* with "https://www.ics.uci.edu/". In this case, we didn't need the
* referringPage parameter to make the decision.
*/
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready
* to be processed by your program.
*/
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
}
}
Crawl depth
crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);
Enable SSL
CrawlConfig config = new CrawlConfig();
config.setIncludeHttpsPages(true);
Maximum number of pages to crawl
crawlConfig.setMaxPagesToFetch(maxPagesToFetch);
Enable Binary Content Crawling
crawlConfig.setIncludeBinaryContentInCrawling(true);
Politeness
crawlConfig.setPolitenessDelay(politenessDelay);
Proxy
crawlConfig.setProxyHost("proxyserver.example.com");
crawlConfig.setProxyPort(8080);
Resumable Crawling
crawlConfig.setResumableCrawling(true);
User agent string
"crawler4j (https://github.com/yasserg/crawler4j/)"
docker wordpress + nginx returning empty response on curl without headers
#Block bad-bots
if ($http_user_agent ~* (...|Curl|...) ) {
return 444;
}
$ echo 'curl/7.64.1' | grep -iPo '(...some...|Curl|...other...)'
curl
-----------------------
#Block bad-bots
if ($http_user_agent ~* (...|Curl|...) ) {
return 444;
}
$ echo 'curl/7.64.1' | grep -iPo '(...some...|Curl|...other...)'
curl
QUESTION
docker wordpress + nginx returning empty response on curl without headers
Asked 2021-Nov-17 at 16:04I have a wordpress+nginx in a docker container that is working perfectly through the browser, but when I try to send an http request via curl without headers the response is always empty
❯ curl -vv localhost:8080
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.64.1
> Accept: */*
>
* Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server
* Closing connection 0
It does work if I add any User-agent header with the -H option, but I would like it to work even when there's no user-agent in the headers.
Here are my nginx settings:
worker_processes 1;
daemon off;
events {
worker_connections 1024;
}
http {
root /var/www/html;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /dev/stdout main;
error_log /dev/stderr error;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
#keepalive low (5seconds), should force hackers to re-connect.
keepalive_timeout 5;
fastcgi_intercept_errors on;
fastcgi_buffers 16 16k;
fastcgi_buffer_size 32k;
default_type application/octet-stream;
#php max upload limit cannot be larger than this
client_max_body_size 40m;
gzip on;
gzip_disable "msie6";
gzip_min_length 256;
gzip_comp_level 4;
gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript application/javascript image/svg+xml;
limit_req_zone $remote_addr zone=loginauth:10m rate=15r/s;
include /etc/nginx/mime.types;
include /etc/nginx/nginx-server.conf;
}
server {
listen 8080 default_server;
server_name "localhost";
access_log /dev/stdout main;
error_log /dev/stdout error;
# pass the PHP scripts to FastCGI
location ~ \.php$ {
include fastcgi_params;
fastcgi_pass unix:/home/www-data/php-fpm.sock;
fastcgi_index index.php;
fastcgi_param DOCUMENT_ROOT $realpath_root;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_intercept_errors on;
}
#Deny access to .htaccess, .htpasswd...
location ~ /\.ht {
deny all;
}
location ~* .(jpg|jpeg|png|gif|ico|css|js|pdf|doc|docx|odt|rtf|ppt|pptx|xls|xlsx|txt)$ {
expires max;
}
location = /favicon.ico {
log_not_found off;
access_log off;
}
location = /robots.txt {
allow all;
log_not_found off;
access_log off;
}
#Block bad-bots
if ($http_user_agent ~* (360Spider|80legs.com|Abonti|AcoonBot|Acunetix|adbeat_bot|AddThis.com|adidxbot|ADmantX|AhrefsBot|AngloINFO|Antelope|Applebot|BaiduSpider|BeetleBot|billigerbot|binlar|bitlybot|BlackWidow|BLP_bbot|BoardReader|Bolt\ 0|BOT\ for\ JCE|Bot\ mailto\:craftbot@yahoo\.com|casper|CazoodleBot|CCBot|checkprivacy|ChinaClaw|chromeframe|Clerkbot|Cliqzbot|clshttp|CommonCrawler|comodo|CPython|crawler4j|Crawlera|CRAZYWEBCRAWLER|Curious|Curl|Custo|CWS_proxy|Default\ Browser\ 0|diavol|DigExt|Digincore|DIIbot|discobot|DISCo|DoCoMo|DotBot|Download\ Demon|DTS.Agent|EasouSpider|eCatch|ecxi|EirGrabber|Elmer|EmailCollector|EmailSiphon|EmailWolf|Exabot|ExaleadCloudView|ExpertSearchSpider|ExpertSearch|Express\ WebPictures|ExtractorPro|extract|EyeNetIE|Ezooms|F2S|FastSeek|feedfinder|FeedlyBot|FHscan|finbot|Flamingo_SearchEngine|FlappyBot|FlashGet|flicky|Flipboard|g00g1e|Genieo|genieo|GetRight|GetWeb\!|GigablastOpenSource|GozaikBot|Go\!Zilla|Go\-Ahead\-Got\-It|GrabNet|grab|Grafula|GrapeshotCrawler|GTB5|GT\:\:WWW|Guzzle|harvest|heritrix|HMView|HomePageBot|HTTP\:\:Lite|HTTrack|HubSpot|ia_archiver|icarus6|IDBot|id\-search|IlseBot|Image\ Stripper|Image\ Sucker|Indigonet|Indy\ Library|integromedb|InterGET|InternetSeer\.com|Internet\ Ninja|IRLbot|ISC\ Systems\ iRc\ Search\ 2\.1|jakarta|Java|JetCar|JobdiggerSpider|JOC\ Web\ Spider|Jooblebot|kanagawa|KINGSpider|kmccrew|larbin|LeechFTP|libwww|Lingewoud|LinkChecker|linkdexbot|LinksCrawler|LinksManager\.com_bot|linkwalker|LinqiaRSSBot|LivelapBot|ltx71|LubbersBot|lwp\-trivial|Mail.RU_Bot|masscan|Mass\ Downloader|maverick|Maxthon$|Mediatoolkitbot|MegaIndex|MegaIndex|megaindex|MFC_Tear_Sample|Microsoft\ URL\ Control|microsoft\.url|MIDown\ tool|miner|Missigua\ Locator|Mister\ PiX|mj12bot|Mozilla.*Indy|Mozilla.*NEWT|MSFrontPage|msnbot|Navroad|NearSite|NetAnts|netEstate|NetSpider|NetZIP|Net\ Vampire|NextGenSearchBot|nutch|Octopus|Offline\ Explorer|Offline\ Navigator|OpenindexSpider|OpenWebSpider|OrangeBot|Owlin|PageGrabber|PagesInventory|panopta|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PeoplePal|Photon|PHPCrawl|planetwork|PleaseCrawl|PNAMAIN.EXE|PodcastPartyBot|prijsbest|proximic|psbot|purebot|pycurl|QuerySeekerSpider|R6_CommentReader|R6_FeedFetcher|RealDownload|ReGet|Riddler|Rippers\ 0|rogerbot|RSSingBot|rv\:1.9.1|RyzeCrawler|SafeSearch|SBIder|Scrapy|Scrapy|Screaming|SeaMonkey$|search.goo.ne.jp|SearchmetricsBot|search_robot|SemrushBot|Semrush|SentiBot|SEOkicks|SeznamBot|ShowyouBot|SightupBot|SISTRIX|sitecheck\.internetseer\.com|siteexplorer.info|SiteSnagger|skygrid|Slackbot|Slurp|SmartDownload|Snoopy|Sogou|Sosospider|spaumbot|Steeler|sucker|SuperBot|Superfeedr|SuperHTTP|SurdotlyBot|Surfbot|tAkeOut|Teleport\ Pro|TinEye-bot|TinEye|Toata\ dragostea\ mea\ pentru\ diavola|Toplistbot|trendictionbot|TurnitinBot|turnit|Twitterbot|URI\:\:Fetch|urllib|Vagabondo|Vagabondo|vikspider|VoidEYE|VoilaBot|WBSearchBot|webalta|WebAuto|WebBandit|WebCollage|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebWhacker|WebZIP|Web\ Image\ Collector|Web\ Sucker|Wells\ Search\ II|WEP\ Search|WeSEE|Wget|Widow|WinInet|woobot|woopingbot|worldwebheritage.org|Wotbox|WPScan|WWWOFFLE|WWW\-Mechanize|Xaldon\ WebSpider|XoviBot|yacybot|Yahoo|YandexBot|Yandex|YisouSpider|zermelo|Zeus|zh-CN|ZmEu|ZumBot|ZyBorg) ) {
return 444;
}
include /etc/nginx/nginx-locations.conf;
include /var/www/nginx/locations/*;
}
# Deny all attempts to access hidden files such as .htaccess, .htpasswd, .DS_Store (Mac).
# Keep logging the requests to parse later (or to pass to firewall utilities such as fail2ban)
location ~ /\. {
deny all;
}
# Deny access to any files with a .php extension in the uploads directory for the single site
location ~ ^/wp-content/uploads/.*\.php$ {
deny all;
}
#Deny access to wp-content folders for suspicious files
location ~* ^/(wp-content)/(.*?)\.(zip|gz|tar|bzip2|7z)\$ { deny all; }
location ~ ^/wp-content/uploads/sucuri { deny all; }
location ~ ^/wp-content/updraft { deny all; }
location ~* ^/wp-content/uploads/.*.(html|htm|shtml|php|js|swf)$ {
deny all;
}
# Block PHP files in includes directory.
location ~* /wp-includes/.*\.php\$ {
deny all;
}
# Deny access to any files with a .php extension in the uploads directory
# Works in sub-directory installs and also in multisite network
# Keep logging the requests to parse later (or to pass to firewall utilities such as fail2ban)
location ~* /(?:uploads|files|wp-content|wp-includes)/.*\.php$ {
deny all;
}
# Block nginx-help log from public viewing
location ~* /wp-content/uploads/nginx-helper/ { deny all; }
# Deny access to any files with a .php extension in the uploads directory
# Works in sub-directory installs and also in multisite network
location ~* /(?:uploads|files)/.*\.php\$ { deny all; }
# Deny access to uploads that aren’t images, videos, music, etc.
location ~* ^/wp-content/uploads/.*.(html|htm|shtml|php|js|swf|css)$ {
deny all;
}
location / {
# This is cool because no php is touched for static content.
# include the "?$args" part so non-default permalinks doesn't break when using query string
index index.php index.html;
try_files $uri $uri/ /index.php?$args;
}
# More ideas from:
# https://gist.github.com/ethanpil/1bfd01a817a8198369efec5c4cde6628
location ~* /(\.|wp-config\.php|wp-config\.txt|changelog\.txt|readme\.txt|readme\.html|license\.txt) { deny all; }
# Make sure files with the following extensions do not get loaded by nginx because nginx would display the source code, and these files can contain PASSWORDS!
location ~* \.(engine|inc|info|install|make|module|profile|test|po|sh|.*sql|theme|tpl(\.php)?|xtmpl)\$|^(\..*|Entries.*|Repository|Root|Tag|Template)\$|\.php_
{
return 444;
}
#nocgi
location ~* \.(pl|cgi|py|sh|lua)\$ {
return 444;
}
#disallow
location ~* (w00tw00t) {
return 444;
}
My aim is to get the server to respond to any request, even if it has no User-agent header.
Thanks for your time!
ANSWER
Answered 2021-Nov-17 at 16:04This has nothing to do with docker or wordpress or something else.
It is your nginx-configuration solely that rejecting the request:
You have Curl
in your http-agent comparison in nginx-server.conf
:
#Block bad-bots
if ($http_user_agent ~* (...|Curl|...) ) {
return 444;
}
and because ~*
is a case-insensitive matching operator, every request from curl would return with 444 here.
Example how you can check it using grep:
$ echo 'curl/7.64.1' | grep -iPo '(...some...|Curl|...other...)'
curl
And code 444
is a special nginx’s non-standard code, which when returned would force nginx to close the connection immediately without to send anything to the client. This is comparable to connection reject (closed by peer).
FWIW: (for people searching why the request is not processed as expected) - nginx has a possibility to set debug
for logging (e. g. can be set for certain port-listener in order to debug it), so an error-log would contain detailed information how the request is processed (which locations and rewrite rules are triggered and what happens with the request on every event stage and what the response and error code will be supplied to client at end).
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
No vulnerabilities reported
Save this library and start creating your kit
Explore Related Topics
Save this library and start creating your kit