pywb | Core Python Web Archiving Toolkit for replay and recording | Continuous Backup library
kandi X-RAY | pywb Summary
kandi X-RAY | pywb Summary
Core Python Web Archiving Toolkit for replay and recording of web archives
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Load a resource
- Replace regex match
- Safely close the connection
- Get request body
- Main entry point
- Write multiple cdx index files
- Process acl command
- Return the CDX writer class
- Remove extension from filename
- Return headers as a dict
- Load data from a file
- Rewrite stream
- Handle request
- Write multiple cdx files
- Put custom record
- Fill content type and charset
- Process acl
- Serve the cdx query
- Handle timegate request
- Handles proxy fetch
- Load a WARC resource
- Rewrite a cookie
- Load a CDX resource
- Serve a collection page
- Rewrite dash string
- Parse a Fuzzy_lookup rule
- Serve static files
pywb Key Features
pywb Examples and Code Snippets
from werkzeug.routing import Map, Rule
from werkzeug.exceptions import HTTPException
my_url_map = Map([
# Note that this rule builds on the `/cat_articles` prefix used in the `DispatcherMiddleware` call further down
Rule('/', endp
from pywb.utils.binsearch import iter_range
from pywb.utils.wbexception import NotFoundException
from pywb.warcserver.index.cdxobject import CDXObject
from pywb.utils.format import res_template
def modified_load_index(self, params):
Community Discussions
Trending Discussions on pywb
QUESTION
I'm trying to use a regex in the filter parameter but I can't use the $ to determine the end of a string:
My request URL:
http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=*.com/&matchtype=domain&fl=url&filter=~url:.com/$
- my filter is using the
~
which makes it a regex - validates correctly on a python regex tester: https://pythex.org/ for any .com URL, just TLD, eg: https://stackoverflow.com/
API documentation: https://github.com/ikreymer/pywb/wiki/CDX-Server-API#api-reference
I'm basically getting a lot of results with pages on each website which I don't care about, I just want the TLD.
If I take the $
out it works.
ANSWER
Answered 2017-Oct-11 at 10:11This query should work:
http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=*.com/&fl=url&filter=url:.*\.com/$
But in the future you may have to use http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=*.com/&fl=url&filter=~url:.*\.com/$
there is a known bug in pywb #249. It's hopefully fixed and deployed to index.commoncrawl.org soon. As a temporary work-around: use
=
for regex filters and=~
for "contains" filters.matchType=domain
is not required here as the URL is already matched by an wildcard pattern*.com/
. It's supposed to query domain names, e.g.http://index.commoncrawl.org/CC-MAIN-2017-39-index?url=commoncrawl.org&matchType=domain&fl=url
.the regex is matched from the beginning of the field value, so it should be
.*\.com/$
. See the improved documentation in pywb#250.
QUESTION
I would like to download a subset of a WAT archive segment from Amazon S3.
Background:
Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example, searching for url=www.celebuzz.com/2017-01-04/*&output=json yields JSON-formatted results, one of which is
{
"urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute",
...
"filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz",
...
"offset":"504411150",
"length":"14169",
...
}
The filename
entry indicates which archive segment contains the WARC file for this particular page. This archive file is huge; but fortunately the entry also contains offset
and length
fields, which can be used to request the range of bytes containing the relevant subset of the archive segment (see, e.g., lines 22-30 in this gist).
My question:
Given the location of a WARC file segment, I know how to construct the name of the corresponding WAT archive segment (see, e.g., this tutorial). I only need a subset of the WAT file, so I would like to request a range of bytes. But how do I find the corresponding offset and length for the WAT archive segment?
I have checked the API documentation for the Common Crawl index server, and it isn't clear to me that this is even possible. But in case it is, I'm posting this question.
...ANSWER
Answered 2017-Sep-06 at 08:16After many trial and error I had managed to get a range from a warc file in python and boto3 the following way:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install pywb
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page