browsertrix-crawler | fidelity browser-based crawler

 by   webrecorder JavaScript Version: 0.10.1 License: AGPL-3.0

kandi X-RAY | browsertrix-crawler Summary

kandi X-RAY | browsertrix-crawler Summary

browsertrix-crawler is a JavaScript library. browsertrix-crawler has no vulnerabilities, it has a Strong Copyleft License and it has low support. However browsertrix-crawler has 4 bugs. You can download it from GitHub.

Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              browsertrix-crawler has a low active ecosystem.
              It has 334 star(s) with 41 fork(s). There are 24 watchers for this library.
              There were 4 major release(s) in the last 12 months.
              There are 60 open issues and 114 have been closed. On average issues are closed in 31 days. There are 4 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of browsertrix-crawler is 0.10.1

            kandi-Quality Quality

              browsertrix-crawler has 4 bugs (0 blocker, 0 critical, 3 major, 1 minor) and 0 code smells.

            kandi-Security Security

              browsertrix-crawler has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              browsertrix-crawler code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              browsertrix-crawler is licensed under the AGPL-3.0 License. This license is Strong Copyleft.
              Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

            kandi-Reuse Reuse

              browsertrix-crawler releases are available to install and integrate.
              Installation instructions, examples and code snippets are available.
              It has 105 lines of code, 0 functions and 27 files.
              It has low code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed browsertrix-crawler and discovered the below as its top functions. This is intended to give you an instant insight into browsertrix-crawler implemented functionality, and help decide if they suit your requirements.
            • Main application .
            • Generates CLI options .
            • Prompt the user input
            • Determine if request should be aborted .
            • Create profile page
            • Initialize storage storage
            • Finalize the page
            • Get the default browser version .
            • Gets the browser environment from the browser .
            • Generate a checksum for a file .
            Get all kandi verified functions for this library.

            browsertrix-crawler Key Features

            No Key Features are available at this moment for browsertrix-crawler.

            browsertrix-crawler Examples and Code Snippets

            No Code Snippets are available at this moment for browsertrix-crawler.

            Community Discussions

            No Community Discussions are available at this moment for browsertrix-crawler.Refer to stack overflow page for discussions.

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install browsertrix-crawler

            Browsertrix Crawler requires Docker to be installed on the machine running the crawl. Assuming Docker is installed, you can run a crawl and test your archive with the following steps. You don't even need to clone this repo, just choose a directory where you'd like the crawl data to be placed, and then run the following commands. Replace [URL] with the web site you'd like to crawl.
            Run docker pull webrecorder/browsertrix-crawler
            docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateWACZ --text --collection test
            The crawl will now run and progress of the crawl will be output to the console. Depending on the size of the site, this may take a bit!
            Once the crawl is finished, a WACZ file will be created in crawls/collection/test/test.wacz from the directory you ran the crawl!
            You can go to ReplayWeb.page and open the generated WACZ file and browse your newly crawled archive!
            To include automated text extraction for full text search, add the --text flag.
            To limit the crawl to a maximum number of pages, add --limit P where P is the number of pages that will be crawled.
            To run more than one browser worker and crawl in parallel, and --workers N where N is number of browsers to run in parallel. More browsers will require more CPU and network bandwidth, and does not guarantee faster crawling.
            To crawl into a new directory, specify a different name for the --collection param, or, if omitted, a new collection directory based on current time will be created.
            Browsertrix Crawler uses a browser image which supports amd64 and arm64 (currently oldwebtoday/chrome:91). This means Browsertrix Crawler can be built natively on Apple M1 systems using the default settings. Simply running docker-compose build on an Apple M1 should build a native version that should work for development. On M1 system, the browser used will be Chromium instead of Chrome since there is no Linux build of Chrome for ARM, and this now is handled automatically as part of the build. Note that Chromium is different than Chrome, and for example, some video codecs may not be supported in the ARM / Chromium-based version that would be in the amd64 / Chrome version. For production crawling, it is recommended to run on an amd64 Linux environment.

            Support

            Browsertrix Crawler uses a browser image which supports amd64 and arm64 (currently oldwebtoday/chrome:91). This means Browsertrix Crawler can be built natively on Apple M1 systems using the default settings. Simply running docker-compose build on an Apple M1 should build a native version that should work for development. On M1 system, the browser used will be Chromium instead of Chrome since there is no Linux build of Chrome for ARM, and this now is handled automatically as part of the build. Note that Chromium is different than Chrome, and for example, some video codecs may not be supported in the ARM / Chromium-based version that would be in the amd64 / Chrome version. For production crawling, it is recommended to run on an amd64 Linux environment.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries

            Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Consider Popular JavaScript Libraries

            freeCodeCamp

            by freeCodeCamp

            vue

            by vuejs

            react

            by facebook

            bootstrap

            by twbs

            Try Top Libraries by webrecorder

            pywb

            by webrecorderJavaScript

            archiveweb.page

            by webrecorderJavaScript

            replayweb.page

            by webrecorderJavaScript

            webrecorder-player

            by webrecorderJavaScript

            warcio

            by webrecorderPython