kandi background
Explore Kits

crawler4j | Open Source Simple Web Crawler for Java | Crawler library

 by   zhuoran Java Version: Current License: Apache-2.0

 by   zhuoran Java Version: Current License: Apache-2.0

Download this library from

kandi X-RAY | crawler4j Summary

crawler4j is a Java library typically used in Automation, Crawler applications. crawler4j has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. However crawler4j has 4 bugs. You can download it from GitHub.
##It’s composed of two parts:.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • crawler4j has a highly active ecosystem.
  • It has 30 star(s) with 16 fork(s). There are 11 watchers for this library.
  • It had no major release in the last 12 months.
  • crawler4j has no issues reported. There are 2 open pull requests and 0 closed requests.
  • It has a negative sentiment in the developer community.
  • The latest version of crawler4j is current.
crawler4j Support
Best in #Crawler
Average in #Crawler
crawler4j Support
Best in #Crawler
Average in #Crawler

quality kandi Quality

  • crawler4j has 4 bugs (0 blocker, 0 critical, 3 major, 1 minor) and 95 code smells.
crawler4j Quality
Best in #Crawler
Average in #Crawler
crawler4j Quality
Best in #Crawler
Average in #Crawler

securitySecurity

  • crawler4j has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • crawler4j code analysis shows 0 unresolved vulnerabilities.
  • There are 16 security hotspots that need review.
crawler4j Security
Best in #Crawler
Average in #Crawler
crawler4j Security
Best in #Crawler
Average in #Crawler

license License

  • crawler4j is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
crawler4j License
Best in #Crawler
Average in #Crawler
crawler4j License
Best in #Crawler
Average in #Crawler

buildReuse

  • crawler4j releases are not available. You will need to build from source code and install.
  • Build file is available. You can build the component from source.
  • Installation instructions are not available. Examples and code snippets are available.
  • crawler4j saves you 914 person hours of effort in developing the same functionality from scratch.
  • It has 2086 lines of code, 129 functions and 26 files.
  • It has high code complexity. Code complexity directly impacts maintainability of the code.
crawler4j Reuse
Best in #Crawler
Average in #Crawler
crawler4j Reuse
Best in #Crawler
Average in #Crawler
Top functions reviewed by kandi - BETA

kandi has reviewed crawler4j and discovered the below as its top functions. This is intended to give you an instant insight into crawler4j implemented functionality, and help decide if they suit your requirements.

  • Start monitoring the crawler .
    • Creates a URL instance from a URL .
      • Loads the configuration from configuration file .
        • Performs the crawl .
          • Perform a GET request .
            • Load properties from a list of resources
              • Translates a string .
                • Gets absolute URLs from a jsoup query .
                  • Entry point to the CrawlController .
                    • Tries to consume the content .

                      Get all kandi verified functions for this library.

                      Get all kandi verified functions for this library.

                      crawler4j Key Features

                      crawler4j-core: crawler4j core module.

                      crawler4j-simple: a simple WEB crawler implementation base on crawler4j-core.

                      支持更多类型数据自动识别

                      完善HttpClient部分代码,支持更多抓取模式,支持gzip等格式解析

                      重构并优化架构

                      Email: zoran.wang@gmail.com

                      Twitter: @lopor

                      新浪微博: @王小然

                      Community Discussions

                      Trending Discussions on crawler4j
                      • docker wordpress + nginx returning empty response on curl without headers
                      Trending Discussions on crawler4j

                      QUESTION

                      docker wordpress + nginx returning empty response on curl without headers

                      Asked 2021-Nov-17 at 16:04

                      I have a wordpress+nginx in a docker container that is working perfectly through the browser, but when I try to send an http request via curl without headers the response is always empty

                      ❯ curl -vv localhost:8080
                      *   Trying 127.0.0.1...
                      * TCP_NODELAY set
                      * Connected to localhost (127.0.0.1) port 8080 (#0)
                      > GET / HTTP/1.1
                      > Host: localhost:8080
                      > User-Agent: curl/7.64.1
                      > Accept: */*
                      >
                      * Empty reply from server
                      * Connection #0 to host localhost left intact
                      curl: (52) Empty reply from server
                      * Closing connection 0
                      

                      It does work if I add any User-agent header with the -H option, but I would like it to work even when there's no user-agent in the headers.

                      Here are my nginx settings:

                      • nginx.conf
                      worker_processes 1;
                      daemon off;
                      
                      events {
                          worker_connections 1024;
                      }
                      
                      http {
                          root /var/www/html;
                      
                          log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                                            '$status $body_bytes_sent "$http_referer" '
                                            '"$http_user_agent" "$http_x_forwarded_for"';
                      
                          access_log /dev/stdout main;
                          error_log /dev/stderr error;
                      
                          sendfile on;
                          tcp_nopush on;
                          tcp_nodelay on;
                          #keepalive low (5seconds), should force hackers to re-connect.
                          keepalive_timeout  5;
                          fastcgi_intercept_errors on;
                          fastcgi_buffers 16 16k; 
                          fastcgi_buffer_size 32k;
                          default_type       application/octet-stream;
                      
                          #php max upload limit cannot be larger than this
                          client_max_body_size 40m;
                      
                          gzip on;
                          gzip_disable "msie6";
                          gzip_min_length 256;
                          gzip_comp_level 4;
                          gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript application/javascript image/svg+xml;
                      
                          limit_req_zone $remote_addr zone=loginauth:10m rate=15r/s;
                      
                          include  /etc/nginx/mime.types;
                          include  /etc/nginx/nginx-server.conf;
                      }
                      
                      • nginx-server.conf
                      server {
                          listen 8080 default_server;
                          server_name "localhost";
                      
                          access_log /dev/stdout main;
                          error_log /dev/stdout error;
                      
                          # pass the PHP scripts to FastCGI
                          location ~ \.php$ {
                              include        fastcgi_params;
                              fastcgi_pass   unix:/home/www-data/php-fpm.sock;
                              fastcgi_index  index.php;
                              fastcgi_param  DOCUMENT_ROOT $realpath_root;
                              fastcgi_param  SCRIPT_FILENAME $document_root$fastcgi_script_name;
                              fastcgi_intercept_errors on;
                          }
                      
                          #Deny access to .htaccess, .htpasswd...
                          location ~ /\.ht {
                              deny  all;
                          }
                      
                      
                          location ~* .(jpg|jpeg|png|gif|ico|css|js|pdf|doc|docx|odt|rtf|ppt|pptx|xls|xlsx|txt)$ {
                            expires max;
                          }
                      
                          location = /favicon.ico {
                              log_not_found off;
                              access_log off;
                          }
                      
                          location = /robots.txt {
                              allow all;
                              log_not_found off;
                              access_log off;
                          }
                      
                          #Block bad-bots
                          if ($http_user_agent ~* (360Spider|80legs.com|Abonti|AcoonBot|Acunetix|adbeat_bot|AddThis.com|adidxbot|ADmantX|AhrefsBot|AngloINFO|Antelope|Applebot|BaiduSpider|BeetleBot|billigerbot|binlar|bitlybot|BlackWidow|BLP_bbot|BoardReader|Bolt\ 0|BOT\ for\ JCE|Bot\ mailto\:craftbot@yahoo\.com|casper|CazoodleBot|CCBot|checkprivacy|ChinaClaw|chromeframe|Clerkbot|Cliqzbot|clshttp|CommonCrawler|comodo|CPython|crawler4j|Crawlera|CRAZYWEBCRAWLER|Curious|Curl|Custo|CWS_proxy|Default\ Browser\ 0|diavol|DigExt|Digincore|DIIbot|discobot|DISCo|DoCoMo|DotBot|Download\ Demon|DTS.Agent|EasouSpider|eCatch|ecxi|EirGrabber|Elmer|EmailCollector|EmailSiphon|EmailWolf|Exabot|ExaleadCloudView|ExpertSearchSpider|ExpertSearch|Express\ WebPictures|ExtractorPro|extract|EyeNetIE|Ezooms|F2S|FastSeek|feedfinder|FeedlyBot|FHscan|finbot|Flamingo_SearchEngine|FlappyBot|FlashGet|flicky|Flipboard|g00g1e|Genieo|genieo|GetRight|GetWeb\!|GigablastOpenSource|GozaikBot|Go\!Zilla|Go\-Ahead\-Got\-It|GrabNet|grab|Grafula|GrapeshotCrawler|GTB5|GT\:\:WWW|Guzzle|harvest|heritrix|HMView|HomePageBot|HTTP\:\:Lite|HTTrack|HubSpot|ia_archiver|icarus6|IDBot|id\-search|IlseBot|Image\ Stripper|Image\ Sucker|Indigonet|Indy\ Library|integromedb|InterGET|InternetSeer\.com|Internet\ Ninja|IRLbot|ISC\ Systems\ iRc\ Search\ 2\.1|jakarta|Java|JetCar|JobdiggerSpider|JOC\ Web\ Spider|Jooblebot|kanagawa|KINGSpider|kmccrew|larbin|LeechFTP|libwww|Lingewoud|LinkChecker|linkdexbot|LinksCrawler|LinksManager\.com_bot|linkwalker|LinqiaRSSBot|LivelapBot|ltx71|LubbersBot|lwp\-trivial|Mail.RU_Bot|masscan|Mass\ Downloader|maverick|Maxthon$|Mediatoolkitbot|MegaIndex|MegaIndex|megaindex|MFC_Tear_Sample|Microsoft\ URL\ Control|microsoft\.url|MIDown\ tool|miner|Missigua\ Locator|Mister\ PiX|mj12bot|Mozilla.*Indy|Mozilla.*NEWT|MSFrontPage|msnbot|Navroad|NearSite|NetAnts|netEstate|NetSpider|NetZIP|Net\ Vampire|NextGenSearchBot|nutch|Octopus|Offline\ Explorer|Offline\ Navigator|OpenindexSpider|OpenWebSpider|OrangeBot|Owlin|PageGrabber|PagesInventory|panopta|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PeoplePal|Photon|PHPCrawl|planetwork|PleaseCrawl|PNAMAIN.EXE|PodcastPartyBot|prijsbest|proximic|psbot|purebot|pycurl|QuerySeekerSpider|R6_CommentReader|R6_FeedFetcher|RealDownload|ReGet|Riddler|Rippers\ 0|rogerbot|RSSingBot|rv\:1.9.1|RyzeCrawler|SafeSearch|SBIder|Scrapy|Scrapy|Screaming|SeaMonkey$|search.goo.ne.jp|SearchmetricsBot|search_robot|SemrushBot|Semrush|SentiBot|SEOkicks|SeznamBot|ShowyouBot|SightupBot|SISTRIX|sitecheck\.internetseer\.com|siteexplorer.info|SiteSnagger|skygrid|Slackbot|Slurp|SmartDownload|Snoopy|Sogou|Sosospider|spaumbot|Steeler|sucker|SuperBot|Superfeedr|SuperHTTP|SurdotlyBot|Surfbot|tAkeOut|Teleport\ Pro|TinEye-bot|TinEye|Toata\ dragostea\ mea\ pentru\ diavola|Toplistbot|trendictionbot|TurnitinBot|turnit|Twitterbot|URI\:\:Fetch|urllib|Vagabondo|Vagabondo|vikspider|VoidEYE|VoilaBot|WBSearchBot|webalta|WebAuto|WebBandit|WebCollage|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebWhacker|WebZIP|Web\ Image\ Collector|Web\ Sucker|Wells\ Search\ II|WEP\ Search|WeSEE|Wget|Widow|WinInet|woobot|woopingbot|worldwebheritage.org|Wotbox|WPScan|WWWOFFLE|WWW\-Mechanize|Xaldon\ WebSpider|XoviBot|yacybot|Yahoo|YandexBot|Yandex|YisouSpider|zermelo|Zeus|zh-CN|ZmEu|ZumBot|ZyBorg) ) {
                                    return 444;
                          }
                      
                          include /etc/nginx/nginx-locations.conf;
                          include /var/www/nginx/locations/*;
                      }
                      
                      
                      • nginx-locations.conf
                      # Deny all attempts to access hidden files such as .htaccess, .htpasswd, .DS_Store (Mac).
                      # Keep logging the requests to parse later (or to pass to firewall utilities such as fail2ban)
                      location ~ /\. {
                          deny all;
                      }
                      
                      # Deny access to any files with a .php extension in the uploads directory for the single site
                      location ~ ^/wp-content/uploads/.*\.php$ {
                          deny all;
                      }
                      
                      #Deny access to wp-content folders for suspicious files
                      location ~* ^/(wp-content)/(.*?)\.(zip|gz|tar|bzip2|7z)\$ { deny all; }
                      location ~ ^/wp-content/uploads/sucuri { deny all; }
                      location ~ ^/wp-content/updraft { deny all; }
                      location ~* ^/wp-content/uploads/.*.(html|htm|shtml|php|js|swf)$ {
                              deny all;
                      }
                      
                      # Block PHP files in includes directory.
                      location ~* /wp-includes/.*\.php\$ {
                        deny all;
                      }
                      
                      # Deny access to any files with a .php extension in the uploads directory
                      # Works in sub-directory installs and also in multisite network
                      # Keep logging the requests to parse later (or to pass to firewall utilities such as fail2ban)
                      location ~* /(?:uploads|files|wp-content|wp-includes)/.*\.php$ {
                          deny all;
                      }
                      
                      # Block nginx-help log from public viewing
                      location ~* /wp-content/uploads/nginx-helper/ { deny all; }
                      
                      # Deny access to any files with a .php extension in the uploads directory
                      # Works in sub-directory installs and also in multisite network
                      location ~* /(?:uploads|files)/.*\.php\$ { deny all; }
                      
                      # Deny access to uploads that aren’t images, videos, music, etc.
                      location ~* ^/wp-content/uploads/.*.(html|htm|shtml|php|js|swf|css)$ {
                          deny all;
                      }
                      
                      
                      
                      location / {
                              # This is cool because no php is touched for static content.
                              # include the "?$args" part so non-default permalinks doesn't break when using query string
                                      index  index.php index.html;
                              try_files $uri $uri/ /index.php?$args;
                      }
                      
                      # More ideas from:
                      # https://gist.github.com/ethanpil/1bfd01a817a8198369efec5c4cde6628
                      
                      location ~* /(\.|wp-config\.php|wp-config\.txt|changelog\.txt|readme\.txt|readme\.html|license\.txt) { deny all; }
                      
                      # Make sure files with the following extensions do not get loaded by nginx because nginx would display the source code, and these files can contain PASSWORDS!
                      location ~* \.(engine|inc|info|install|make|module|profile|test|po|sh|.*sql|theme|tpl(\.php)?|xtmpl)\$|^(\..*|Entries.*|Repository|Root|Tag|Template)\$|\.php_
                      {
                          return 444;
                      }
                      #nocgi
                      location ~* \.(pl|cgi|py|sh|lua)\$ {
                          return 444;
                      }
                      #disallow
                      location ~* (w00tw00t) {
                          return 444;
                      }
                      
                      

                      My aim is to get the server to respond to any request, even if it has no User-agent header.

                      Thanks for your time!

                      ANSWER

                      Answered 2021-Nov-17 at 16:04

                      This has nothing to do with docker or wordpress or something else.
                      It is your nginx-configuration solely that rejecting the request:

                      You have Curl in your http-agent comparison in nginx-server.conf:

                          #Block bad-bots
                          if ($http_user_agent ~* (...|Curl|...) ) {
                                    return 444;
                          }
                      

                      and because ~* is a case-insensitive matching operator, every request from curl would return with 444 here.

                      Example how you can check it using grep:

                      $ echo 'curl/7.64.1' | grep -iPo '(...some...|Curl|...other...)'
                      curl
                      

                      And code 444 is a special nginx’s non-standard code, which when returned would force nginx to close the connection immediately without to send anything to the client. This is comparable to connection reject (closed by peer).

                      FWIW: (for people searching why the request is not processed as expected) - nginx has a possibility to set debug for logging (e. g. can be set for certain port-listener in order to debug it), so an error-log would contain detailed information how the request is processed (which locations and rewrite rules are triggered and what happens with the request on every event stage and what the response and error code will be supplied to client at end).

                      Source https://stackoverflow.com/questions/69915359

                      Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                      Vulnerabilities

                      No vulnerabilities reported

                      Install crawler4j

                      You can download it from GitHub.
                      You can use crawler4j like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the crawler4j component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

                      Support

                      For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

                      DOWNLOAD this Library from

                      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                      over 430 million Knowledge Items
                      Find more libraries
                      Reuse Solution Kits and Libraries Curated by Popular Use Cases
                      Explore Kits

                      Save this library and start creating your kit

                      Explore Related Topics

                      Share this Page

                      share link
                      Consider Popular Crawler Libraries
                      Try Top Libraries by zhuoran
                      Compare Crawler Libraries with Highest Support
                      Compare Crawler Libraries with Highest Quality
                      Compare Crawler Libraries with Highest Security
                      Compare Crawler Libraries with Permissive License
                      Compare Crawler Libraries with Highest Reuse
                      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                      over 430 million Knowledge Items
                      Find more libraries
                      Reuse Solution Kits and Libraries Curated by Popular Use Cases
                      Explore Kits

                      Save this library and start creating your kit

                      • © 2022 Open Weaver Inc.