nutch | Apache Nutch is an extensible and scalable web crawler
kandi X-RAY | nutch Summary
kandi X-RAY | nutch Summary
.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- This method is used to reduce a set of values .
- Return a record writer .
- Dumps output file to output directory .
- Creates a graph .
- Retrieves the name and content attributes from the DOM .
- Helper method to process stats .
- Map a key to a crawldatum .
- Merge segments .
- Inject a crawl .
- Read the next gzip record .
nutch Key Features
nutch Examples and Code Snippets
Community Discussions
Trending Discussions on nutch
QUESTION
I'm trying to use Apache Nutch 1.x Rest API. I use docker images to set up Nutch and Solr. You can see the demo repo in here
Apache Nutch uses Solr as its dependents. Solr works great, I'm able to reach its GUI at localhost:8983
.
However, I cannot reach Apache Nutch's API at localhost:8081
. The problem starts here. The Apache Nutch 1.X RESTAPI doc indicates that I can start the server like this
2. :~$ bin/nutch startserver -port [If the port option is not mentioned then by default the server starts on port 8081]
Which I am doing in docker-compose.yml file. I'm also exposing the ports to the outside.
...ANSWER
Answered 2021-Jun-14 at 14:50nutch
by default only reply to requests from localhost
:
QUESTION
If I set up my WCF project with an ApplicationInsights.config
file as outlined in this Microsoft documentation, data is logged to Application Insights as expected.
The config file looks like this:
...ANSWER
Answered 2021-Mar-11 at 02:25The correct approach is to use TelemetryConfiguration.CreateDefault() method to load any config from disk, then set/change additional values on the generated configuration.
Once the TelemetryConfiguration instance is created pass it to the constructor of TelemetryClient to create the client and start logging.
QUESTION
I am using nutch 1.15 and solr 7.3, and I followed search highlight as per doc - https://lucene.apache.org/solr/guide/7_3/highlighting.html
For me, normal query for nutch solr search is working and it is returning results:
curl http://localhost:8983/solr/nutch/select?q=content:build&wt=json&rows=10000&start=0
With search highlight query I am getting same results but getting a warning.- hl.q=content:build: not found
The query with highlight params are like below - curl http://localhost:8983/solr/nutch/select?q=content:build&hl=on&hl.q=content:build&wt=json&rows=10000&start=0
See the complete response -
...ANSWER
Answered 2021-Feb-08 at 19:10You're not running the command you think you're running - &
signals to the shell that the command should be run in the background, so what's effectively happening is that you're running multiple commands:
QUESTION
We have a server deployed on amazon aws, the problem we are facing is that when ever there's a special character in the URL, it redirects to a 403 Forbidden error. It works fine on my local environment but not on live. See below
Does not work:
/checkout/cart/delete/id/243687/form_key/8182e1mPZIipGrXO/uenc/aHR0cHM6Ly93d3cuaG9iby5jb20ucGsvY2hlY2tvdXQvY2FydC8,
Works:
/checkout/cart/delete/id/243687/form_key/8182e1mPZIipGrXO/uenc/aHR0cHM6Ly93d3cuaG9iby5jb20ucGsvY2hlY2tvdXQvY2FydC8
Does not work:
/index.php/admin/catalog_product/new/attributes/OTI%253D/set/4/type/configurable/key/9f01c4b1a3f8c70002f3465b5899a54d
Works:
/index.php/admin/catalog_product/new/attributes/OTI253D/set/4/type/configurable/key/9f01c4b1a3f8c70002f3465b5899a54d
.htaccess for debugging
Given below is the htaccess code, but the thing is that this code works on my local.
...ANSWER
Answered 2021-Jan-01 at 10:14Try removing the query string 403 lines.
It could work locally if you don't have mod alias enabled as those lines will be skipped.
QUESTION
I have a question in connection with the configuration of nutch and solr. Do I have to name the _default directory in solr, nutch and do I have to mark the head of the schema.xml file as nutch or can I give any name for it?
Thanks in advance
...ANSWER
Answered 2020-Nov-30 at 12:20Nutch itself doesn't use the schema.xml
file, it is provided as a base schema.xml to use in Solr (or as an example detailing which fields need to be added to your own schema). The name
property of the schema.xml
doesn't have to be nutch
it is provided just as an indication that the configuration is related to the operation of Nutch. Keep in mind that this file is only relevant to Solr's configuration.
QUESTION
I want to index the source code of my crawled web pages by Apache Nutch (v1.17) to index in Solr (8.6.3), but don't know how to do that? At least I just get a prepared version indexed to Solr content (see below).
...ANSWER
Answered 2020-Nov-19 at 20:38The Nutch index tool provides a command-line option to index the raw content of web pages:
QUESTION
I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.
- One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48 hours.
- After this, run crawler with same 1 million domains after 5 to 6 hour and only select those URLs that are new on those domains.
- After the job completion, index URLs in Solr
- Later on, there is no need to store raw HTML, hence to save storage (HDFS), remove raw data only and maintain each page metadata so that in next job, we should avoid to re-fetch a page again (before its scheduled time).
There isn't any other processing or post analysis. Now, I have a choice to use Hadoop cluster of medium size (max 30 machine). Each machine has 16GB RAM, 12 Cores and 2 TB Storage. Solr machine(s) are also of same spaces. Now, to maintain above, I am curious about followings:
...ANSWER
Answered 2020-Sep-28 at 20:45a. How to achieve above document crawl rate i.e., how many machines are enough ?
Assuming a polite delay between successive fetches to the same domain is chosen: let's assume 10 pages can be fetcher per domain and minute, the max. crawl rate is 600 million pages per hour (10^6*10*60
). A cluster with 360 cores should be enough to come close to this rate. Whether it's possible to crawl the one million domains exhaustively within 48 hours depends on the size of each of the domains. Keep in mind, that the mentioned crawl rate of 10 pages per domain and minute, it's only possible to fetch 10*60*48 = 28800
pages per domain within 48 hours.
c. Is it possible to remove raw data from Nutch and keep metadata only ?
As soon as a segment was indexed you can delete it. The CrawlDb is sufficient to decide whether a link found on one of the 1 million home pages is new.
- After the job completion, index URLs in Solr
Maybe index segments immediately after each cycle.
b. Should I need to add more machines or is there any better solution ? d. Is there any best strategy to achieve the above objectives.
A lot depends on whether the domains are of similar size or not. In case they show a power-law distribution (that's likely), you have few domains with multiple millions of pages (hardly crawled exhaustively) and a long tail of domains with only few pages (max. few hundred pages). In this situation you need less resources but more time to achieve the desired result.
QUESTION
I don't know if the guide is possibly outdated, or I'm doing something wrong. I just started using nutch, and I've integrated it with solr and crawled/indexed through some websites via terminal. Now I'm trying to use them in a java application, so I've been following the tutorial here: https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse
I downloaded Subclipse, IvyDE and m2e through Eclipse, and I downloaded ant, so I should have all the prerequisites. The m2e link through the tutorial is broken, so I found it somewhere else. It also turns out that eclipse already had it upon installation.
I get a huge list of error messages when I run 'ant eclipse' in terminal. Due to word count, put a link to a pastebin with the entire error message here
I'm really not sure what I'm doing wrong. The directions aren't that complicated, so I really don't know where I'm messing up.
Just in case it's necessary, here is the nutch-site.xml that we needed to modify.
...ANSWER
Answered 2020-Sep-24 at 04:15As guided in the LOG file
QUESTION
I would like to crawl through a list of sites using Nutch, then break up each document into paragraphs and sending them to Solr for indexing.
I have been using the following script to automate the process of crawling/fetching/parsing/indexing:
...ANSWER
Answered 2020-Sep-22 at 12:05Currently, there is not a very easy answer to your question. To accomplish this you need custom code, specifically, Nutch has two different plugins to deal with parsing HTML code parse-html
and parse-tika
. These plugins are focused on extracting text content and not so much structured data out of the HTML document.
You would need to have a custom parser (HtmlParserPugin
) plugin that will treat paragraph nodes within your HTML document in a custom way (extracting the content and positional information).
The other component that you would need is for modeling the data in Solr, since you need to keep the position of the paragraph within the same document you also need to send this data in a way that it is searchable in Solr, perhaps using nested documents (this really depends on how you plan to use the data).
For instance, you may take a look at this plugin which implements custom logic for extracting data using arbitrary X Path expressions from the HTML.
QUESTION
trying to crawl using NUTCH 1.17 but the URL is being rejected, there is #! in the URL example : xxmydomain.com/xxx/#!/xxx/abc.html
also I have tried to include
+^/
+^#! in my regex-urlfilter
...ANSWER
Answered 2020-Sep-21 at 14:33- If you particularly check in the regex-normalize.xml file This particular rule file will be applied as part of urlnormalizer-regex plugin. This plugin is default included in plugin-includes in nutch-site.xml.
As part of URL Normalizationg, This particular line will truncate URLs if anything present after URLFragment
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install nutch
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page