warc | Web archiver to bundle web page

by go-shiori Go Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | warc Summary

warc is a Go library. warc has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

[Go Report Card] This project is now archived. If you want to archive, consider checking out [obelisk] It has better output format (plain HTML) and IMHO better written than this. WARC is a Go package that archive a web page and its resources into a single [bolt] database file. Developed as part of [Shiori] bookmarks manager. It still in development phase but should be stable enough to use. The bolt database that used by this project is also stable both in API and file format. Unfortunately, right now WARC will disable Javascript when archiving a page so it still doesn’t not work in SPA site like Twitter or Reddit.

Support

Quality

Security

License

Reuse

Support

warc has a low active ecosystem.

It has 9 star(s) with 5 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 1 have been closed. On average issues are closed in 4 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of warc is current.

Quality

warc has no bugs reported.

Security

warc has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

warc is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

warc releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed warc and discovered the below as its top functions. This is intended to give you an instant insight into warc implemented functionality, and help decide if they suit your requirements.

processJS processes a JS resource
ProcessHTMLFile processes an HTML file .
createResource creates and returns a Resource .
processCSS processes a CSS file
Process media tag
fixLazyImages fix lazy images in the given dom .
NewArchive starts a new archival database
Read returns the content of the archive .
disableXHR disables an XML HTTP request .
processMetaTag processes a meta tag

Get all kandi verified functions for this library.

warc Key Features

No Key Features are available at this moment for warc.

warc Examples and Code Snippets

No Code Snippets are available at this moment for warc.

Community Discussions

Trending Discussions on warc

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

Number of records in WARC file

Half of read buffer is corrupt when using ReadFile

Streaming in a gzipped file from s3 in python

how to read a 5gb file sequentially using a variable window

How to retrieve the HTML of a page from CommonCrawl?

Python: Reading a file and adding keys and values to dictionaries from different lines

How do I use Zlib with concatenated .gz files in winAPI?

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

Multithreaded Client that sends data in a queue and stores data in another, while not blocking in Rust Tokio

QUESTION

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

Asked 2021-Mar-21 at 15:34

I can obtain listing for Common Crawl by:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz

How can I do this with Common Crawl News Dataset ?

I tried different options, but always getting errors:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS-2017-09/warc.paths.gz

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/09/warc.paths.gz

...

ANSWER

Answered 2021-Mar-21 at 15:34

Since every few hours a new WARC file is added to the news dataset, a static file list does not make sense. Instead you can get a list of files using the AWS CLI - for any subset by year or month, e.g.

Source https://stackoverflow.com/questions/66725153

QUESTION

Number of records in WARC file

Asked 2021-Jan-24 at 12:59

I currently parsing WARC files from CommonCrawl corpus and I would like to know upfront, without iterating through all WARC records, how many records are there.

Does WARC 1.1 standard defines such information?

...

ANSWER

Answered 2021-Jan-24 at 12:59

The WARC standard does not define a standard way to indicate the number of WARC records in the WARC file itself. The number of response records in Common Crawl WARC files is usually between 30,000 and 50,000 - note that there are also request and metadata records. The WARC standard recommends 1 GB as target size of WARC files which puts a natural limit to the number of records.

Source https://stackoverflow.com/questions/65848795

QUESTION

Half of read buffer is corrupt when using ReadFile

Asked 2020-Dec-05 at 03:16

Half of the buffer used with ReadFile is corrupt. Regardless of the size of the buffer, half of it has the same corrupted character. I have look for anything that could be causing the read to stop early, etc. If I increase the size of the buffer, I see more of the file so it is not failing on a particular part of the file.

Visual Studio 2019. Windows 10.

...

ANSWER

Answered 2020-Dec-05 at 03:16

The reason is that you use a buffer array of type TCHAR, and the size of TCHAR type is 2 bytes. So the bufferSize set when you call the ReadFile function is actually filled into the buffer array every 2 bytes.

But the actual size of the buffer is sizeof(TCHAR) * fileSize, so half of the buffer array you see is "corrupted"

Source https://stackoverflow.com/questions/65130428

QUESTION

Streaming in a gzipped file from s3 in python

Asked 2020-Dec-01 at 08:02

Hi I'm working on a project for fun with the common crawl data I have a subset of the most current crawls warc file paths from here

so basically I have a url like https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/warc/CC-MAIN-20201019145901-20201019175901-00000.warc.gz (the first url in the warc paths) and I am streaming in the request like so:

...

ANSWER

Answered 2020-Dec-01 at 08:02

Why not use a WARC parser library (I'd recommend warcio) to do the parsing including gzip decompression?

Alternatively, have a look at gzipstream to read from a stream of gzipped content and decompress the data on the fly.

Source https://stackoverflow.com/questions/65066562

QUESTION

how to read a 5gb file sequentially using a variable window

Asked 2020-Nov-23 at 14:33

Processing Common Crawl warc files. These are 5gb uncompressed. Inside there is text, xml and warc headers.

This is the code I am particulary having trouble with:

...

ANSWER

Answered 2020-Nov-20 at 09:51

Which give me the error, "expression must have a pointer to class type".

TCHAR has no such method as substr.

modify:

Source https://stackoverflow.com/questions/64923646

QUESTION

How to retrieve the HTML of a page from CommonCrawl?

Asked 2020-Oct-26 at 08:48

Assuming I have:

the link of the CC*.warc file (and the file itself, if it helps);
offset; and
length

How can I get the HTML content of that page?

Thanks for your time and attention.

...

ANSWER

Answered 2020-Oct-26 at 08:48

Using warcio it would be simply:

Source https://stackoverflow.com/questions/64508226

QUESTION

Python: Reading a file and adding keys and values to dictionaries from different lines

Asked 2020-Sep-30 at 19:34

I'm very new to Python and I'm having trouble working on an assignment which basically is like this:

#Read line by line a WARC file to identify string1.

#When string1 found, add part of the string as a key to a dictionary.

#Then continue reading file to identify string2, and add part of string2 as a value to the previous key.

#Keep going through file and doing the same to build the dictionary.

I can't import anything so it's causing me a bit of trouble, especially adding the key, then leaving the value empty and continue going through the file to find string2 to be used as value.

I've started thinking something like saving the key to an intermediate variable, then going on to identify the value, add to an intermediate variable and finally build the dictionary.

...

ANSWER

Answered 2020-Sep-30 at 13:45

Your idea with storing the key to an intermediate value is good.

I also suggest using the following snippet to iterate over the lines.

Source https://stackoverflow.com/questions/64137851

QUESTION

How do I use Zlib with concatenated .gz files in winAPI?

Asked 2020-Sep-24 at 18:47

I am downloading common crawl files from AWS. Apparently, they are large concatenated .gz files, which is supported by the gzip standard. I am using zlib to deflate but I only get the decompressed contents of the file up to the first concatenation. I have tried adding inflateReset() but then I get error -5, which indicates a buffer or file problem. I suspect I am close.

here's the code without inflateReset. It works fine on non-concatenated files.

...

ANSWER

Answered 2020-Sep-23 at 23:15

Does your compiler not at least warn you about the naked conditional ret == Z_STREAM_END;? You want an if there and some braces around the inflateReset() related statements.

There's still a problem in that you are leaving the outer loop if strm.avail_in is zero. That will happen every time, except when reaching the end of member. It can even happen then if you just so happen to exhaust the input buffer to decompress that member. Just make the outer loop a while (true).

Even after fixing all that, you would then discard the remaining available input when you do the read at the top of the outer loop. Only do that read if strm.avail_in is zero.

A simpler approach would be to do the reset in the inner loop. Like this (example in C):

Source https://stackoverflow.com/questions/64019925

QUESTION

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

Asked 2020-Sep-15 at 12:59

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment folder.

I am using nutch v-1.17 and running:

...

ANSWER

Answered 2020-Sep-15 at 12:59

Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.

Source https://stackoverflow.com/questions/63899204

QUESTION

Multithreaded Client that sends data in a queue and stores data in another, while not blocking in Rust Tokio

Asked 2020-Jul-25 at 15:38

I'm having difficulties in making a Tokio client that receives packets from a server and stores them in a queue for the main thread to process, while being able to send packets to the server from another queue at the same time.

I'm trying to make a very simple online game demonstration, having a game client that Sends data (it's own modified states, like player movement) and receives data (Game states modified by other players & server, like an NPC/other players that also moved).

The idea is to have a network thread that accesses two Arcs holding Mutexes to Vec that store serialized data. One Arc is for IncomingPackets, and the other for OutgoingPackets. IncomingPackets would be filled by packets sent from the server to the client that would be later read by the main thread, and OutgoingPackets would be filled by the main thread with packets that should be sent to the server.

I can't seem to receive or send packets in another thread.

The client would only connect to the server, and the server would allow many clients (which would be served individually).

The explanations around stream's usage and implementation are not newbie-friendly, but I think I should be using them somehow.

I wrote some code, but it does not work and is probably wrong.

(My original code does not compile, so treat this as pseudocode, sorry)

Playground

...

ANSWER

Answered 2020-Jul-20 at 13:55

Here's an example that's a bit contrived, but it should help:

Playground link

Source https://stackoverflow.com/questions/62984936

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install warc

To install this package, just run go get :.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: