storm-crawler | versatile web crawler based on Apache Storm

 by   DigitalPebble HTML Version: 2.10 License: Apache-2.0

kandi X-RAY | storm-crawler Summary

kandi X-RAY | storm-crawler Summary

storm-crawler is a HTML library typically used in Big Data applications. storm-crawler has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

A scalable, mature and versatile web crawler based on Apache Storm
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              storm-crawler has a medium active ecosystem.
              It has 803 star(s) with 246 fork(s). There are 71 watchers for this library.
              There were 2 major release(s) in the last 12 months.
              There are 41 open issues and 653 have been closed. On average issues are closed in 11 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of storm-crawler is 2.10

            kandi-Quality Quality

              storm-crawler has 0 bugs and 0 code smells.

            kandi-Security Security

              storm-crawler has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              storm-crawler code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              storm-crawler is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              storm-crawler releases are available to install and integrate.
              It has 21381 lines of code, 1076 functions and 233 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed storm-crawler and discovered the below as its top functions. This is intended to give you an instant insight into storm-crawler implemented functionality, and help decide if they suit your requirements.
            • Synchronized
            • Store the given url to a tuple
            • Store a fetch request
            • Store a tuple to the cache
            • Prepare URL partition
            • Initialize metrics
            • Initialize configuration
            • Performs the actual processing
            • Called after a bulk request is received
            • Handle bulk request
            • Configures the OkHttpClient
            • Registers bulk request
            • Called after a bulk request completes
            • Configures the fetching process
            • Process a single document
            • Schedules the timestamp
            • Populate the buffer
            • Executes the given tuple
            • Format the WARC version
            • Runs the actual parsing
            • Populate the query buffer
            • Do the actual parsing
            • Process a tuple
            • Called when a tuple arrives
            • Parse the content type
            • Moves to the next WARC record
            • Returns the HTTP response
            • Execute the content
            • Format the WARC record
            Get all kandi verified functions for this library.

            storm-crawler Key Features

            No Key Features are available at this moment for storm-crawler.

            storm-crawler Examples and Code Snippets

            No Code Snippets are available at this moment for storm-crawler.

            Community Discussions

            QUESTION

            Replacement of ESSeedInjector in storm-crawler 2.2
            Asked 2022-Feb-15 at 10:05

            I'm updating our crawler from storm-crawler 1.14 to 2.2. What is the replacement for the old ESSeedInjector?

            ...

            ANSWER

            Answered 2022-Feb-15 at 10:05

            The class-based topologies have been replaced by Flux files, which are far more flexible to use. The injection is now done as part of the crawl as you can see in es-crawler.flux. It would be easy to extract the injection part and put that in a separate script if you want to keep things separate. Alternatively, you could copy the code back from 1.14, put it in your project and fix whatever needs fixing for it to work with Storm 2.x.

            Source https://stackoverflow.com/questions/71122674

            QUESTION

            Why there is not any Bolt for storing crawl results in Stormcrawler when we are using RDBMS?
            Asked 2021-May-27 at 13:29

            I want to use Stormcrawler with an RDBMS engines like Oracle, MySQL, or Postgres. But in the storm-crawler-sql module, we only have a SqlSpout and a StatusUpdaterBolt. We did not find any class for indexing crawl results to the SQL database. Is there any technical reason behind this?

            ...

            ANSWER

            Answered 2021-May-27 at 13:29

            QUESTION

            StormCrawler /Elastic Search Apache Tika for parsing PDF's. Getting error when running topology
            Asked 2021-Feb-23 at 22:09

            I get the following errors when I run the es-crawler.flux topology. I'm not sure what I'm doing wrong. I don't think theres are yaml errors?

            ...

            ANSWER

            Answered 2021-Feb-23 at 22:09

            I copied the Flux file from the Gist above and it ran without problems. Maybe the alignment of the lines is incorrect in your file (e.g. space missing)?

            Source https://stackoverflow.com/questions/66340008

            QUESTION

            How can i debug the the docker container(storm crawler) which is written in java in vs code?
            Asked 2020-Oct-04 at 18:31

            I am unable to get how can i debug the docker container(which is running storm crawler) in the vs code? I tried looking for https://code.visualstudio.com/docs/containers/debug-common and other https://github.com/DigitalPebble/storm-crawler/wiki/Debug-with-Eclipse. But I did not anything, like how can i configure launch.json file for the same.

            Can anyone guide me how can i do this?

            ...

            ANSWER

            Answered 2020-Oct-01 at 12:48

            If you are trying to use the Docker Debugger provided by VSCode, I think you will run into weird issues. The documentation states

            The Docker extension currently supports debugging Node.js, Python, and .NET Core applications within Docker containers.

            In my experience, editing your Java code and Dockerfile, then rebuilding and rerunning the container helps me make edits and poke around my code for any issues.

            Dockerhub may be a good place to search for help too

            Source https://stackoverflow.com/questions/64155656

            QUESTION

            Build Failure in Stromcrawler 1.16
            Asked 2020-Jun-30 at 09:45

            i am using stormcrawler 1.16, apache storm 1.2.3, maven 3.6.3 and jdk 1.8.

            i have created the project using the articfact command below-

            ...

            ANSWER

            Answered 2020-Jun-30 at 09:45

            Can you please paste the content of ESCrawlTopology.java? Did you set com.cnf.245 as package name?

            The template class gets rewritten during the execution of the archetype with the package name substituted, it could be that the value you set broke the template.

            EDIT: you can't use numbers in package names in Java. See Using numbers as package names in java

            Use a different package name and groupID.

            Source https://stackoverflow.com/questions/62646455

            QUESTION

            How to crawl specific data from a website using stormcrawler
            Asked 2020-Jun-19 at 10:00

            I am crawling news websites using stormcrawler(v 1.16) and storing data on Elasticsearch (v 7.5.0). My crawler-conf file is as stormcrawler files.I am using kibana for visualization.My issues are

            • While crawling news website I want only urls of article content but i am also getting urls of ads,other tabs on website.What and where i have to make changes Kibana link
            • if i have to get only specific things from a URL(like only title or only content) how can we do that.

            EDIT: I was thinking to add a field in content index. So i made changes in src/main/resources/parsefilter.json ,ES_IndecInit.sh,and Crawler-conf.yaml. XPATH which i have added is correct . I have added as

            "parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"

            in parsefilter.

            parse.pubDate =PublishDate

            in crawler-conf and added

            PublishDate": { "type": "text", "index": false, "store": true}

            in properties of ES_IndexInit.sh . But still I am not getting any field named PublishDate in kibana or elasticsearch. ES_IndexInit.sh mapping is as folows:

            ...

            ANSWER

            Answered 2020-Jun-18 at 20:29

            One approach to indexing only news pages from a site is to rely on sitemaps, but not all sites will provide these.

            Alternatively, you'd need a mechanism as part of the parsing, maybe in a ParseFilter, to determine that a page is a news item and filter based on the presence of a key / value in the metadata during the indexing.

            The way it is done in the news crawl dataset from CommonCrawl is that the seed URLs are sitemaps or RSS feeds.

            To not index the content, simply comment out

            Source https://stackoverflow.com/questions/62456731

            QUESTION

            Exception with ES query
            Asked 2020-Jun-15 at 08:57

            i am using stormcrawler 1.16 with ELasticsearch-7.2.0. java version is 1.8.0_252 . storm version is 1.2.3, maven version is 3.6.3.

            i have created project using mvn archetype -

            ...

            ANSWER

            Answered 2020-Jun-15 at 08:57

            You shouldn't need to run the ESInitScript again unless you want to delete the URLs that are in the status index. If you run it more than once, there will be nothing in status and this could be why the topology is idle.

            There is no reason why having more URLs in the seeds file would cause a problem, we routinely have seed files with > 1M urls and this is not a problem.

            Source https://stackoverflow.com/questions/62346814

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install storm-crawler

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
            Maven
            Gradle
            CLONE
          • HTTPS

            https://github.com/DigitalPebble/storm-crawler.git

          • CLI

            gh repo clone DigitalPebble/storm-crawler

          • sshUrl

            git@github.com:DigitalPebble/storm-crawler.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link