storm-crawler | versatile web crawler based on Apache Storm

by DigitalPebble HTML Version: 2.10 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(7)Vulnerabilities Install Support

kandi X-RAY | storm-crawler Summary

storm-crawler is a HTML library typically used in Big Data applications. storm-crawler has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

A scalable, mature and versatile web crawler based on Apache Storm

Support

Quality

Security

License

Reuse

Support

storm-crawler has a medium active ecosystem.

It has 803 star(s) with 246 fork(s). There are 71 watchers for this library.

It had no major release in the last 12 months.

There are 41 open issues and 653 have been closed. On average issues are closed in 11 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of storm-crawler is 2.10

Quality

storm-crawler has 0 bugs and 0 code smells.

Security

storm-crawler has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

storm-crawler code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

storm-crawler is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

storm-crawler releases are available to install and integrate.

It has 21381 lines of code, 1076 functions and 233 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed storm-crawler and discovered the below as its top functions. This is intended to give you an instant insight into storm-crawler implemented functionality, and help decide if they suit your requirements.

Synchronized
Store the given url to a tuple
Store a fetch request
Store a tuple to the cache
Prepare URL partition
Initialize metrics
Initialize configuration
Performs the actual processing
Called after a bulk request is received
Handle bulk request
Configures the OkHttpClient
Registers bulk request
Called after a bulk request completes
Configures the fetching process
Process a single document
Schedules the timestamp
Populate the buffer
Executes the given tuple
Format the WARC version
Runs the actual parsing
Populate the query buffer
Do the actual parsing
Process a tuple
Called when a tuple arrives
Parse the content type
Moves to the next WARC record
Returns the HTTP response
Execute the content
Format the WARC record

Get all kandi verified functions for this library.

storm-crawler Key Features

No Key Features are available at this moment for storm-crawler.

storm-crawler Examples and Code Snippets

No Code Snippets are available at this moment for storm-crawler.

Community Discussions

Trending Discussions on storm-crawler

Replacement of ESSeedInjector in storm-crawler 2.2

Why there is not any Bolt for storing crawl results in Stormcrawler when we are using RDBMS?

StormCrawler /Elastic Search Apache Tika for parsing PDF's. Getting error when running topology

How can i debug the the docker container(storm crawler) which is written in java in vs code?

Build Failure in Stromcrawler 1.16

How to crawl specific data from a website using stormcrawler

Exception with ES query

QUESTION

Replacement of ESSeedInjector in storm-crawler 2.2

Asked 2022-Feb-15 at 10:05

I'm updating our crawler from storm-crawler 1.14 to 2.2. What is the replacement for the old ESSeedInjector?

...

ANSWER

Answered 2022-Feb-15 at 10:05

The class-based topologies have been replaced by Flux files, which are far more flexible to use. The injection is now done as part of the crawl as you can see in es-crawler.flux. It would be easy to extract the injection part and put that in a separate script if you want to keep things separate. Alternatively, you could copy the code back from 1.14, put it in your project and fix whatever needs fixing for it to work with Storm 2.x.

Source https://stackoverflow.com/questions/71122674

QUESTION

Why there is not any Bolt for storing crawl results in Stormcrawler when we are using RDBMS?

Asked 2021-May-27 at 13:29

I want to use Stormcrawler with an RDBMS engines like Oracle, MySQL, or Postgres. But in the storm-crawler-sql module, we only have a SqlSpout and a StatusUpdaterBolt. We did not find any class for indexing crawl results to the SQL database. Is there any technical reason behind this?

...

ANSWER

Answered 2021-May-27 at 13:29

What's wrong with the IndexerBolt?

Source https://stackoverflow.com/questions/67720408

QUESTION

StormCrawler /Elastic Search Apache Tika for parsing PDF's. Getting error when running topology

Asked 2021-Feb-23 at 22:09

I get the following errors when I run the es-crawler.flux topology. I'm not sure what I'm doing wrong. I don't think theres are yaml errors?

...

ANSWER

Answered 2021-Feb-23 at 22:09

I copied the Flux file from the Gist above and it ran without problems. Maybe the alignment of the lines is incorrect in your file (e.g. space missing)?

Source https://stackoverflow.com/questions/66340008

QUESTION

How can i debug the the docker container(storm crawler) which is written in java in vs code?

Asked 2020-Oct-04 at 18:31

I am unable to get how can i debug the docker container(which is running storm crawler) in the vs code? I tried looking for https://code.visualstudio.com/docs/containers/debug-common and other https://github.com/DigitalPebble/storm-crawler/wiki/Debug-with-Eclipse. But I did not anything, like how can i configure launch.json file for the same.

Can anyone guide me how can i do this?

...

ANSWER

Answered 2020-Oct-01 at 12:48

If you are trying to use the Docker Debugger provided by VSCode, I think you will run into weird issues. The documentation states

The Docker extension currently supports debugging Node.js, Python, and .NET Core applications within Docker containers.

In my experience, editing your Java code and Dockerfile, then rebuilding and rerunning the container helps me make edits and poke around my code for any issues.

Dockerhub may be a good place to search for help too

Source https://stackoverflow.com/questions/64155656

QUESTION

Build Failure in Stromcrawler 1.16

Asked 2020-Jun-30 at 09:45

i am using stormcrawler 1.16, apache storm 1.2.3, maven 3.6.3 and jdk 1.8.

i have created the project using the articfact command below-

...

ANSWER

Answered 2020-Jun-30 at 09:45

Can you please paste the content of ESCrawlTopology.java? Did you set com.cnf.245 as package name?

The template class gets rewritten during the execution of the archetype with the package name substituted, it could be that the value you set broke the template.

EDIT: you can't use numbers in package names in Java. See Using numbers as package names in java

Use a different package name and groupID.

Source https://stackoverflow.com/questions/62646455

QUESTION

How to crawl specific data from a website using stormcrawler

Asked 2020-Jun-19 at 10:00

I am crawling news websites using stormcrawler(v 1.16) and storing data on Elasticsearch (v 7.5.0). My crawler-conf file is as stormcrawler files.I am using kibana for visualization.My issues are

While crawling news website I want only urls of article content but i am also getting urls of ads,other tabs on website.What and where i have to make changes Kibana link
if i have to get only specific things from a URL(like only title or only content) how can we do that.

EDIT: I was thinking to add a field in content index. So i made changes in src/main/resources/parsefilter.json ,ES_IndecInit.sh,and Crawler-conf.yaml. XPATH which i have added is correct . I have added as

"parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"

in parsefilter.

parse.pubDate =PublishDate

in crawler-conf and added

PublishDate": { "type": "text", "index": false, "store": true}

in properties of ES_IndexInit.sh . But still I am not getting any field named PublishDate in kibana or elasticsearch. ES_IndexInit.sh mapping is as folows:

...

ANSWER

Answered 2020-Jun-18 at 20:29

One approach to indexing only news pages from a site is to rely on sitemaps, but not all sites will provide these.

Alternatively, you'd need a mechanism as part of the parsing, maybe in a ParseFilter, to determine that a page is a news item and filter based on the presence of a key / value in the metadata during the indexing.

The way it is done in the news crawl dataset from CommonCrawl is that the seed URLs are sitemaps or RSS feeds.

To not index the content, simply comment out

Source https://stackoverflow.com/questions/62456731

QUESTION

Exception with ES query

Asked 2020-Jun-15 at 08:57

i am using stormcrawler 1.16 with ELasticsearch-7.2.0. java version is 1.8.0_252 . storm version is 1.2.3, maven version is 3.6.3.

i have created project using mvn archetype -

...

ANSWER

Answered 2020-Jun-15 at 08:57

You shouldn't need to run the ESInitScript again unless you want to delete the URLs that are in the status index. If you run it more than once, there will be nothing in status and this could be why the topology is idle.

There is no reason why having more URLs in the seeds file would cause a problem, we routinely have seed files with > 1M urls and this is not a problem.

Source https://stackoverflow.com/questions/62346814

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install storm-crawler

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: