nutch | Apache Nutch is an extensible and scalable web crawler

by apache Java Version: 1.19 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | nutch Summary

nutch is a Java library typically used in Big Data, Hadoop applications. nutch has no bugs, it has a Permissive License and it has medium support. However nutch has 2 vulnerabilities and it build file is not available. You can download it from GitHub, Maven.

.

Support

Quality

Security

License

Reuse

Support

nutch has a medium active ecosystem.

It has 2584 star(s) with 1212 fork(s). There are 239 watchers for this library.

It had no major release in the last 12 months.

nutch has no issues reported. There are 11 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of nutch is 1.19

Quality

nutch has no bugs reported.

Security

nutch has 2 vulnerability issues reported (2 critical, 0 high, 0 medium, 0 low).

License

nutch is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

nutch releases are not available. You will need to build from source code and install.

Deployable package is available in Maven.

nutch has no build file. You will be need to create the build yourself to build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed nutch and discovered the below as its top functions. This is intended to give you an instant insight into nutch implemented functionality, and help decide if they suit your requirements.

This method is used to reduce a set of values .
Return a record writer .
Dumps output file to output directory .
Creates a graph .
Retrieves the name and content attributes from the DOM .
Helper method to process stats .
Map a key to a crawldatum .
Merge segments .
Inject a crawl .
Read the next gzip record .

Get all kandi verified functions for this library.

nutch Key Features

No Key Features are available at this moment for nutch.

nutch Examples and Code Snippets

No Code Snippets are available at this moment for nutch.

Community Discussions

Trending Discussions on nutch

Apache Nutch doesn't expose its API

How to configure Application Insights with instrumentation keys from multiple environments in WCF?

Solr not returning highlighted results

Special characters in URL leads to 403

Configuration of schema.xml for nutch and solr

How to index crawled "html" from Apache Nutch to Solr?

Nutch 1.17 web crawling with storage optimization

Integrating Nutch 1.17 with Eclipse (Ubuntu 18.04)

Parsing paragraphs into separate documents in Solr using script

Not able to crawl a URL as there is special character

QUESTION

Apache Nutch doesn't expose its API

Asked 2021-Jun-14 at 14:50

I'm trying to use Apache Nutch 1.x Rest API. I use docker images to set up Nutch and Solr. You can see the demo repo in here

Apache Nutch uses Solr as its dependents. Solr works great, I'm able to reach its GUI at localhost:8983.

However, I cannot reach Apache Nutch's API at localhost:8081. The problem starts here. The Apache Nutch 1.X RESTAPI doc indicates that I can start the server like this 2. :~$ bin/nutch startserver -port [If the port option is not mentioned then by default the server starts on port 8081]

Which I am doing in docker-compose.yml file. I'm also exposing the ports to the outside.

...

ANSWER

Answered 2021-Jun-14 at 14:50

nutch by default only reply to requests from localhost:

Source https://stackoverflow.com/questions/67949442

QUESTION

How to configure Application Insights with instrumentation keys from multiple environments in WCF?

Asked 2021-Mar-13 at 12:40

If I set up my WCF project with an ApplicationInsights.config file as outlined in this Microsoft documentation, data is logged to Application Insights as expected.

Is there any way to specify instrumentation keys on a per-environment basis when using the ApplicationInsights.config file?

The config file looks like this:

...

ANSWER

Answered 2021-Mar-11 at 02:25

The correct approach is to use TelemetryConfiguration.CreateDefault() method to load any config from disk, then set/change additional values on the generated configuration.

Once the TelemetryConfiguration instance is created pass it to the constructor of TelemetryClient to create the client and start logging.

Source https://stackoverflow.com/questions/66573338

QUESTION

Solr not returning highlighted results

Asked 2021-Feb-08 at 19:10

I am using nutch 1.15 and solr 7.3, and I followed search highlight as per doc - https://lucene.apache.org/solr/guide/7_3/highlighting.html

For me, normal query for nutch solr search is working and it is returning results: curl http://localhost:8983/solr/nutch/select?q=content:build&wt=json&rows=10000&start=0

With search highlight query I am getting same results but getting a warning.- hl.q=content:build: not found

The query with highlight params are like below - curl http://localhost:8983/solr/nutch/select?q=content:build&hl=on&hl.q=content:build&wt=json&rows=10000&start=0

See the complete response -

...

ANSWER

Answered 2021-Feb-08 at 19:10

You're not running the command you think you're running - & signals to the shell that the command should be run in the background, so what's effectively happening is that you're running multiple commands:

Source https://stackoverflow.com/questions/66107707

QUESTION

Special characters in URL leads to 403

Asked 2021-Jan-01 at 10:14

We have a server deployed on amazon aws, the problem we are facing is that when ever there's a special character in the URL, it redirects to a 403 Forbidden error. It works fine on my local environment but not on live. See below

Does not work:

/checkout/cart/delete/id/243687/form_key/8182e1mPZIipGrXO/uenc/aHR0cHM6Ly93d3cuaG9iby5jb20ucGsvY2hlY2tvdXQvY2FydC8,

Works:

/checkout/cart/delete/id/243687/form_key/8182e1mPZIipGrXO/uenc/aHR0cHM6Ly93d3cuaG9iby5jb20ucGsvY2hlY2tvdXQvY2FydC8

Does not work:

/index.php/admin/catalog_product/new/attributes/OTI%253D/set/4/type/configurable/key/9f01c4b1a3f8c70002f3465b5899a54d

Works:

/index.php/admin/catalog_product/new/attributes/OTI253D/set/4/type/configurable/key/9f01c4b1a3f8c70002f3465b5899a54d

.htaccess for debugging

Given below is the htaccess code, but the thing is that this code works on my local.

...

ANSWER

Answered 2021-Jan-01 at 10:14

Try removing the query string 403 lines.

It could work locally if you don't have mod alias enabled as those lines will be skipped.

Source https://stackoverflow.com/questions/65525825

QUESTION

Configuration of schema.xml for nutch and solr

Asked 2020-Nov-30 at 12:20

I have a question in connection with the configuration of nutch and solr. Do I have to name the _default directory in solr, nutch and do I have to mark the head of the schema.xml file as nutch or can I give any name for it?

Thanks in advance

...

ANSWER

Answered 2020-Nov-30 at 12:20

Nutch itself doesn't use the schema.xml file, it is provided as a base schema.xml to use in Solr (or as an example detailing which fields need to be added to your own schema). The name property of the schema.xml doesn't have to be nutch it is provided just as an indication that the configuration is related to the operation of Nutch. Keep in mind that this file is only relevant to Solr's configuration.

Source https://stackoverflow.com/questions/65062703

QUESTION

How to index crawled "html" from Apache Nutch to Solr?

Asked 2020-Nov-19 at 20:38

I want to index the source code of my crawled web pages by Apache Nutch (v1.17) to index in Solr (8.6.3), but don't know how to do that? At least I just get a prepared version indexed to Solr content (see below).

...

ANSWER

Answered 2020-Nov-19 at 20:38

The Nutch index tool provides a command-line option to index the raw content of web pages:

Source https://stackoverflow.com/questions/64913186

QUESTION

Nutch 1.17 web crawling with storage optimization

Asked 2020-Sep-28 at 20:45

I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.

One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48 hours.
After this, run crawler with same 1 million domains after 5 to 6 hour and only select those URLs that are new on those domains.
After the job completion, index URLs in Solr
Later on, there is no need to store raw HTML, hence to save storage (HDFS), remove raw data only and maintain each page metadata so that in next job, we should avoid to re-fetch a page again (before its scheduled time).

There isn't any other processing or post analysis. Now, I have a choice to use Hadoop cluster of medium size (max 30 machine). Each machine has 16GB RAM, 12 Cores and 2 TB Storage. Solr machine(s) are also of same spaces. Now, to maintain above, I am curious about followings:

...

ANSWER

Answered 2020-Sep-28 at 20:45

a. How to achieve above document crawl rate i.e., how many machines are enough ?

Assuming a polite delay between successive fetches to the same domain is chosen: let's assume 10 pages can be fetcher per domain and minute, the max. crawl rate is 600 million pages per hour (10^6*10*60). A cluster with 360 cores should be enough to come close to this rate. Whether it's possible to crawl the one million domains exhaustively within 48 hours depends on the size of each of the domains. Keep in mind, that the mentioned crawl rate of 10 pages per domain and minute, it's only possible to fetch 10*60*48 = 28800 pages per domain within 48 hours.

c. Is it possible to remove raw data from Nutch and keep metadata only ?

As soon as a segment was indexed you can delete it. The CrawlDb is sufficient to decide whether a link found on one of the 1 million home pages is new.

After the job completion, index URLs in Solr

Maybe index segments immediately after each cycle.

b. Should I need to add more machines or is there any better solution ? d. Is there any best strategy to achieve the above objectives.

A lot depends on whether the domains are of similar size or not. In case they show a power-law distribution (that's likely), you have few domains with multiple millions of pages (hardly crawled exhaustively) and a long tail of domains with only few pages (max. few hundred pages). In this situation you need less resources but more time to achieve the desired result.

Source https://stackoverflow.com/questions/64058149

QUESTION

Integrating Nutch 1.17 with Eclipse (Ubuntu 18.04)

Asked 2020-Sep-24 at 04:15

I don't know if the guide is possibly outdated, or I'm doing something wrong. I just started using nutch, and I've integrated it with solr and crawled/indexed through some websites via terminal. Now I'm trying to use them in a java application, so I've been following the tutorial here: https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse

I downloaded Subclipse, IvyDE and m2e through Eclipse, and I downloaded ant, so I should have all the prerequisites. The m2e link through the tutorial is broken, so I found it somewhere else. It also turns out that eclipse already had it upon installation.

I get a huge list of error messages when I run 'ant eclipse' in terminal. Due to word count, put a link to a pastebin with the entire error message here

I'm really not sure what I'm doing wrong. The directions aren't that complicated, so I really don't know where I'm messing up.

Just in case it's necessary, here is the nutch-site.xml that we needed to modify.

...

ANSWER

Answered 2020-Sep-24 at 04:15

As guided in the LOG file

Source https://stackoverflow.com/questions/64036421

QUESTION

Parsing paragraphs into separate documents in Solr using script

Asked 2020-Sep-22 at 12:05

I would like to crawl through a list of sites using Nutch, then break up each document into paragraphs and sending them to Solr for indexing.

I have been using the following script to automate the process of crawling/fetching/parsing/indexing:

...

ANSWER

Answered 2020-Sep-22 at 12:05

Currently, there is not a very easy answer to your question. To accomplish this you need custom code, specifically, Nutch has two different plugins to deal with parsing HTML code parse-html and parse-tika. These plugins are focused on extracting text content and not so much structured data out of the HTML document.

You would need to have a custom parser (HtmlParserPugin) plugin that will treat paragraph nodes within your HTML document in a custom way (extracting the content and positional information).

The other component that you would need is for modeling the data in Solr, since you need to keep the position of the paragraph within the same document you also need to send this data in a way that it is searchable in Solr, perhaps using nested documents (this really depends on how you plan to use the data).

For instance, you may take a look at this plugin which implements custom logic for extracting data using arbitrary X Path expressions from the HTML.

Source https://stackoverflow.com/questions/63969324

QUESTION

Not able to crawl a URL as there is special character

Asked 2020-Sep-21 at 14:33

trying to crawl using NUTCH 1.17 but the URL is being rejected, there is #! in the URL example : xxmydomain.com/xxx/#!/xxx/abc.html

also I have tried to include

+^/

+^#! in my regex-urlfilter

...

ANSWER

Answered 2020-Sep-21 at 14:33

If you particularly check in the regex-normalize.xml file This particular rule file will be applied as part of urlnormalizer-regex plugin. This plugin is default included in plugin-includes in nutch-site.xml.

As part of URL Normalizationg, This particular line will truncate URLs if anything present after URLFragment

Source https://stackoverflow.com/questions/63959900

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install nutch

Generate Eclipse project files. and follow the instructions in [Importing existing projects](https://help.eclipse.org/2019-06/topic/org.eclipse.platform.doc.user/tasks/tasks-importproject.htm). IntelliJ IDEA users can also import Eclipse projects using the ["Eclipser" plugin](https://www.tutorialspoint.com/intellij_idea/intellij_idea_migrating_from_eclipse.htm)https://plugins.jetbrains.com/plugin/7153-eclipser), see also [Importing Eclipse Projects into IntelliJ IDEA](https://www.jetbrains.com/help/idea/migrating-from-eclipse-to-intellij-idea.html#migratingEclipseProject).