storm-crawler | versatile web crawler based on Apache Storm
kandi X-RAY | storm-crawler Summary
kandi X-RAY | storm-crawler Summary
A scalable, mature and versatile web crawler based on Apache Storm
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Synchronized
- Store the given url to a tuple
- Store a fetch request
- Store a tuple to the cache
- Prepare URL partition
- Initialize metrics
- Initialize configuration
- Performs the actual processing
- Called after a bulk request is received
- Handle bulk request
- Configures the OkHttpClient
- Registers bulk request
- Called after a bulk request completes
- Configures the fetching process
- Process a single document
- Schedules the timestamp
- Populate the buffer
- Executes the given tuple
- Format the WARC version
- Runs the actual parsing
- Populate the query buffer
- Do the actual parsing
- Process a tuple
- Called when a tuple arrives
- Parse the content type
- Moves to the next WARC record
- Returns the HTTP response
- Execute the content
- Format the WARC record
storm-crawler Key Features
storm-crawler Examples and Code Snippets
Community Discussions
Trending Discussions on storm-crawler
QUESTION
I'm updating our crawler from storm-crawler 1.14 to 2.2. What is the replacement for the old ESSeedInjector?
...ANSWER
Answered 2022-Feb-15 at 10:05The class-based topologies have been replaced by Flux files, which are far more flexible to use. The injection is now done as part of the crawl as you can see in es-crawler.flux. It would be easy to extract the injection part and put that in a separate script if you want to keep things separate. Alternatively, you could copy the code back from 1.14, put it in your project and fix whatever needs fixing for it to work with Storm 2.x.
QUESTION
I want to use Stormcrawler with an RDBMS engines like Oracle, MySQL, or Postgres. But in the storm-crawler-sql module, we only have a SqlSpout and a StatusUpdaterBolt. We did not find any class for indexing crawl results to the SQL database. Is there any technical reason behind this?
...ANSWER
Answered 2021-May-27 at 13:29What's wrong with the IndexerBolt?
QUESTION
I get the following errors when I run the es-crawler.flux topology. I'm not sure what I'm doing wrong. I don't think theres are yaml errors?
...ANSWER
Answered 2021-Feb-23 at 22:09I copied the Flux file from the Gist above and it ran without problems. Maybe the alignment of the lines is incorrect in your file (e.g. space missing)?
QUESTION
I am unable to get how can i debug the docker container(which is running storm crawler) in the vs code? I tried looking for https://code.visualstudio.com/docs/containers/debug-common
and other https://github.com/DigitalPebble/storm-crawler/wiki/Debug-with-Eclipse
.
But I did not anything, like how can i configure launch.json file for the same.
Can anyone guide me how can i do this?
...ANSWER
Answered 2020-Oct-01 at 12:48If you are trying to use the Docker Debugger provided by VSCode, I think you will run into weird issues. The documentation states
The Docker extension currently supports debugging Node.js, Python, and .NET Core applications within Docker containers.
In my experience, editing your Java code and Dockerfile, then rebuilding and rerunning the container helps me make edits and poke around my code for any issues.
Dockerhub may be a good place to search for help too
QUESTION
i am using stormcrawler 1.16, apache storm 1.2.3, maven 3.6.3 and jdk 1.8.
i have created the project using the articfact command below-
...ANSWER
Answered 2020-Jun-30 at 09:45Can you please paste the content of ESCrawlTopology.java? Did you set com.cnf.245 as package name?
The template class gets rewritten during the execution of the archetype with the package name substituted, it could be that the value you set broke the template.
EDIT: you can't use numbers in package names in Java. See Using numbers as package names in java
Use a different package name and groupID.
QUESTION
I am crawling news websites using stormcrawler(v 1.16) and storing data on Elasticsearch (v 7.5.0). My crawler-conf file is as stormcrawler files.I am using kibana for visualization.My issues are
- While crawling news website I want only urls of article content but i am also getting urls of ads,other tabs on website.What and where i have to make changes Kibana link
- if i have to get only specific things from a URL(like only title or only content) how can we do that.
EDIT: I was thinking to add a field in content index. So i made changes in src/main/resources/parsefilter.json ,ES_IndecInit.sh,and Crawler-conf.yaml. XPATH which i have added is correct . I have added as
"parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"
in parsefilter.
parse.pubDate =PublishDate
in crawler-conf and added
PublishDate": {
"type": "text",
"index": false,
"store": true}
in properties of ES_IndexInit.sh . But still I am not getting any field named PublishDate in kibana or elasticsearch. ES_IndexInit.sh mapping is as folows:
...ANSWER
Answered 2020-Jun-18 at 20:29One approach to indexing only news pages from a site is to rely on sitemaps, but not all sites will provide these.
Alternatively, you'd need a mechanism as part of the parsing, maybe in a ParseFilter, to determine that a page is a news item and filter based on the presence of a key / value in the metadata during the indexing.
The way it is done in the news crawl dataset from CommonCrawl is that the seed URLs are sitemaps or RSS feeds.
To not index the content, simply comment out
QUESTION
i am using stormcrawler 1.16 with ELasticsearch-7.2.0. java version is 1.8.0_252 . storm version is 1.2.3, maven version is 3.6.3.
i have created project using mvn archetype -
...ANSWER
Answered 2020-Jun-15 at 08:57You shouldn't need to run the ESInitScript again unless you want to delete the URLs that are in the status index. If you run it more than once, there will be nothing in status and this could be why the topology is idle.
There is no reason why having more URLs in the seeds file would cause a problem, we routinely have seed files with > 1M urls and this is not a problem.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install storm-crawler
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page