browsertrix-crawler | fidelity browser-based crawler

by webrecorder JavaScript Version: 0.10.1 License: AGPL-3.0

X-Ray Key Features Code Snippets Community Discussions Vulnerabilities Install Support

kandi X-RAY | browsertrix-crawler Summary

browsertrix-crawler is a JavaScript library. browsertrix-crawler has no vulnerabilities, it has a Strong Copyleft License and it has low support. However browsertrix-crawler has 4 bugs. You can download it from GitHub.

Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel.

Support

Quality

Security

License

Reuse

Support

browsertrix-crawler has a low active ecosystem.

It has 334 star(s) with 41 fork(s). There are 24 watchers for this library.

It had no major release in the last 12 months.

There are 60 open issues and 114 have been closed. On average issues are closed in 31 days. There are 4 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of browsertrix-crawler is 0.10.1

Quality

browsertrix-crawler has 4 bugs (0 blocker, 0 critical, 3 major, 1 minor) and 0 code smells.

Security

browsertrix-crawler has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

browsertrix-crawler code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

browsertrix-crawler is licensed under the AGPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

browsertrix-crawler releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

It has 105 lines of code, 0 functions and 27 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed browsertrix-crawler and discovered the below as its top functions. This is intended to give you an instant insight into browsertrix-crawler implemented functionality, and help decide if they suit your requirements.

Main application .
Generates CLI options .
Prompt the user input
Determine if request should be aborted .
Create profile page
Initialize storage storage
Finalize the page
Get the default browser version .
Gets the browser environment from the browser .
Generate a checksum for a file .

Get all kandi verified functions for this library.

browsertrix-crawler Key Features

No Key Features are available at this moment for browsertrix-crawler.

browsertrix-crawler Examples and Code Snippets

No Code Snippets are available at this moment for browsertrix-crawler.

Community Discussions

No Community Discussions are available at this moment for browsertrix-crawler.Refer to stack overflow page for discussions.

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install browsertrix-crawler

Browsertrix Crawler requires Docker to be installed on the machine running the crawl. Assuming Docker is installed, you can run a crawl and test your archive with the following steps. You don't even need to clone this repo, just choose a directory where you'd like the crawl data to be placed, and then run the following commands. Replace [URL] with the web site you'd like to crawl.
Run docker pull webrecorder/browsertrix-crawler
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateWACZ --text --collection test
The crawl will now run and progress of the crawl will be output to the console. Depending on the size of the site, this may take a bit!
Once the crawl is finished, a WACZ file will be created in crawls/collection/test/test.wacz from the directory you ran the crawl!
You can go to ReplayWeb.page and open the generated WACZ file and browse your newly crawled archive!
To include automated text extraction for full text search, add the --text flag.
To limit the crawl to a maximum number of pages, add --limit P where P is the number of pages that will be crawled.
To run more than one browser worker and crawl in parallel, and --workers N where N is number of browsers to run in parallel. More browsers will require more CPU and network bandwidth, and does not guarantee faster crawling.
To crawl into a new directory, specify a different name for the --collection param, or, if omitted, a new collection directory based on current time will be created.
Browsertrix Crawler uses a browser image which supports amd64 and arm64 (currently oldwebtoday/chrome:91). This means Browsertrix Crawler can be built natively on Apple M1 systems using the default settings. Simply running docker-compose build on an Apple M1 should build a native version that should work for development. On M1 system, the browser used will be Chromium instead of Chrome since there is no Linux build of Chrome for ARM, and this now is handled automatically as part of the build. Note that Chromium is different than Chrome, and for example, some video codecs may not be supported in the ARM / Chromium-based version that would be in the amd64 / Chrome version. For production crawling, it is recommended to run on an amd64 Linux environment.

Support

Browsertrix Crawler uses a browser image which supports amd64 and arm64 (currently oldwebtoday/chrome:91). This means Browsertrix Crawler can be built natively on Apple M1 systems using the default settings. Simply running docker-compose build on an Apple M1 should build a native version that should work for development. On M1 system, the browser used will be Chromium instead of Chrome since there is no Linux build of Chrome for ARM, and this now is handled automatically as part of the build. Note that Chromium is different than Chrome, and for example, some video codecs may not be supported in the ARM / Chromium-based version that would be in the amd64 / Chrome version. For production crawling, it is recommended to run on an amd64 Linux environment.

Find more information at: