pywb | Core Python Web Archiving Toolkit for replay and recording | Continuous Backup library

by webrecorder JavaScript Version: 2.8.3 License: GPL-3.0

X-Ray Key Features Code Snippets(2)Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | pywb Summary

pywb is a JavaScript library typically used in Backup Recovery, Continuous Backup applications. pywb has no bugs, it has no vulnerabilities, it has a Strong Copyleft License and it has medium support. You can install using 'pip install pywb' or download it from GitHub, PyPI.

Core Python Web Archiving Toolkit for replay and recording of web archives

Support

Quality

Security

License

Reuse

Support

pywb has a medium active ecosystem.

It has 1146 star(s) with 193 fork(s). There are 59 watchers for this library.

It had no major release in the last 12 months.

There are 131 open issues and 308 have been closed. On average issues are closed in 41 days. There are 8 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of pywb is 2.8.3

Quality

pywb has 0 bugs and 0 code smells.

Security

pywb has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

pywb code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

pywb is licensed under the GPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

pywb releases are available to install and integrate.

Deployable package is available in PyPI.

Top functions reviewed by kandi - BETA

kandi has reviewed pywb and discovered the below as its top functions. This is intended to give you an instant insight into pywb implemented functionality, and help decide if they suit your requirements.

Load a resource
Replace regex match
Safely close the connection
Get request body
Main entry point
Write multiple cdx index files
Process acl command
Return the CDX writer class
Remove extension from filename
Return headers as a dict
Load data from a file
Rewrite stream
Handle request
Write multiple cdx files
Put custom record
Fill content type and charset
Process acl
Serve the cdx query
Handle timegate request
Handles proxy fetch
Load a WARC resource
Rewrite a cookie
Load a CDX resource
Serve a collection page
Rewrite dash string
Parse a Fuzzy_lookup rule
Serve static files

Get all kandi verified functions for this library.

pywb Key Features

No Key Features are available at this moment for pywb.

pywb Examples and Code Snippets

How to include a variable in URL path using wsgi? (not query string)

Python

Lines of Code : 38

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from werkzeug.routing import Map, Rule
from werkzeug.exceptions import HTTPException

my_url_map = Map([
    # Note that this rule builds on the `/cat_articles` prefix used in the `DispatcherMiddleware` call further down
    Rule('/', endp

Rewrite functions in pywb without code source changing

Python

Lines of Code : 26

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from pywb.utils.binsearch import iter_range
from pywb.utils.wbexception import NotFoundException
from pywb.warcserver.index.cdxobject import CDXObject
from pywb.utils.format import res_template

def modified_load_index(self, params):

Community Discussions

Trending Discussions on pywb

RegEx on CommonCrawl API filter parameter

Get offset and length of a subset of a WAT archive from Common Crawl index server

QUESTION

RegEx on CommonCrawl API filter parameter

Asked 2017-Oct-11 at 10:11

I'm trying to use a regex in the filter parameter but I can't use the $ to determine the end of a string:

My request URL:

http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=*.com/&matchtype=domain&fl=url&filter=~url:.com/$

my filter is using the ~ which makes it a regex
validates correctly on a python regex tester: https://pythex.org/ for any .com URL, just TLD, eg: https://stackoverflow.com/

API documentation: https://github.com/ikreymer/pywb/wiki/CDX-Server-API#api-reference

I'm basically getting a lot of results with pages on each website which I don't care about, I just want the TLD. If I take the $ out it works.

...

ANSWER

Answered 2017-Oct-11 at 10:11

This query should work: http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=*.com/&fl=url&filter=url:.*\.com/$

But in the future you may have to use http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=*.com/&fl=url&filter=~url:.*\.com/$

there is a known bug in pywb #249. It's hopefully fixed and deployed to index.commoncrawl.org soon. As a temporary work-around: use = for regex filters and =~ for "contains" filters.
matchType=domain is not required here as the URL is already matched by an wildcard pattern *.com/. It's supposed to query domain names, e.g. http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=commoncrawl.org&matchType=domain&fl=url.
the regex is matched from the beginning of the field value, so it should be .*\.com/$. See the improved documentation in pywb#250.

Source https://stackoverflow.com/questions/46672538

QUESTION

Get offset and length of a subset of a WAT archive from Common Crawl index server

Asked 2017-Sep-11 at 09:51

I would like to download a subset of a WAT archive segment from Amazon S3.

Background:

Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example, searching for url=www.celebuzz.com/2017-01-04/*&output=json yields JSON-formatted results, one of which is

{ "urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute", ... "filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz", ... "offset":"504411150", "length":"14169", ... }

The filename entry indicates which archive segment contains the WARC file for this particular page. This archive file is huge; but fortunately the entry also contains offset and length fields, which can be used to request the range of bytes containing the relevant subset of the archive segment (see, e.g., lines 22-30 in this gist).

My question:

Given the location of a WARC file segment, I know how to construct the name of the corresponding WAT archive segment (see, e.g., this tutorial). I only need a subset of the WAT file, so I would like to request a range of bytes. But how do I find the corresponding offset and length for the WAT archive segment?

I have checked the API documentation for the Common Crawl index server, and it isn't clear to me that this is even possible. But in case it is, I'm posting this question.

...

ANSWER

Answered 2017-Sep-06 at 08:16

After many trial and error I had managed to get a range from a warc file in python and boto3 the following way:

Source https://stackoverflow.com/questions/45920527

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pywb

You can install using 'pip install pywb' or download it from GitHub, PyPI.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: