kandi background
Explore Kits

webmagic | A scalable web crawler framework for Java. | Crawler library

 by   code4craft Java Version: WebMagic-0.7.3 License: Apache-2.0

 by   code4craft Java Version: WebMagic-0.7.3 License: Apache-2.0

Download this library from

kandi X-RAY | webmagic Summary

webmagic is a Java library typically used in Automation, Crawler, Framework applications. webmagic has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can download it from GitHub, Maven.
A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • webmagic has a medium active ecosystem.
  • It has 10324 star(s) with 4072 fork(s). There are 799 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 291 open issues and 582 have been closed. On average issues are closed in 174 days. There are 33 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of webmagic is WebMagic-0.7.3
webmagic Support
Best in #Crawler
Average in #Crawler
webmagic Support
Best in #Crawler
Average in #Crawler

quality kandi Quality

  • webmagic has 0 bugs and 0 code smells.
webmagic Quality
Best in #Crawler
Average in #Crawler
webmagic Quality
Best in #Crawler
Average in #Crawler

securitySecurity

  • webmagic has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • webmagic code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
webmagic Security
Best in #Crawler
Average in #Crawler
webmagic Security
Best in #Crawler
Average in #Crawler

license License

  • webmagic is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
webmagic License
Best in #Crawler
Average in #Crawler
webmagic License
Best in #Crawler
Average in #Crawler

buildReuse

  • webmagic releases are available to install and integrate.
  • Deployable package is available in Maven.
  • Build file is available. You can build the component from source.
  • Installation instructions are not available. Examples and code snippets are available.
  • webmagic saves you 8030 person hours of effort in developing the same functionality from scratch.
  • It has 16523 lines of code, 1075 functions and 268 files.
  • It has medium code complexity. Code complexity directly impacts maintainability of the code.
webmagic Reuse
Best in #Crawler
Average in #Crawler
webmagic Reuse
Best in #Crawler
Average in #Crawler
Top functions reviewed by kandi - BETA

kandi has reviewed webmagic and discovered the below as its top functions. This is intended to give you an instant insight into webmagic implemented functionality, and help decide if they suit your requirements.

  • Process single field .
  • Loads the configuration .
  • Handle object map .
  • Start the spider .
  • Evaluates the script .
  • Detect charset from content type .
  • Generate http client .
  • Enqueue a runnable .
  • Convert the request to a HttpUriRequest object .
  • Read options .

webmagic Key Features

A scalable web crawler framework for Java.

Install:

copy iconCopydownload iconDownload
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.5</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.5</version>
</dependency>

First crawler:

copy iconCopydownload iconDownload
public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}

how to find empty use mongodb?

copy iconCopydownload iconDownload
db.xx.find({"fields.name.sourceTexts":[]})
-----------------------
db.xx.find( { "fields.name.sourceTexts" : { $exists:true, $size:0 } } )

can't inject repository when use @Autowired in a non-web-application

copy iconCopydownload iconDownload
public class ApplicationContextProvider implements ApplicationContextAware {

private static ApplicationContext context;

public static ApplicationContext getApplicationContext(){
    return context;
}

@Override
public void setApplicationContext(ApplicationContext applicationContext) throws BeansException {
    context = applicationContext;
}
}
ArticleRepository articleRepository= ApplicationContextProvider.getApplicationContext().getBean(ArticleRepository.class);
CategoryRepository categoryRepository= ApplicationContextProvider.getApplicationContext().getBean(CategoryRepository.class);
NewsRepository newsRepository= ApplicationContextProvider.getApplicationContext().getBean(NewsRepository.class);
SourceRepository sourceRepository= ApplicationContextProvider.getApplicationContext().getBean(SourceRepository.class);
-----------------------
public class ApplicationContextProvider implements ApplicationContextAware {

private static ApplicationContext context;

public static ApplicationContext getApplicationContext(){
    return context;
}

@Override
public void setApplicationContext(ApplicationContext applicationContext) throws BeansException {
    context = applicationContext;
}
}
ArticleRepository articleRepository= ApplicationContextProvider.getApplicationContext().getBean(ArticleRepository.class);
CategoryRepository categoryRepository= ApplicationContextProvider.getApplicationContext().getBean(CategoryRepository.class);
NewsRepository newsRepository= ApplicationContextProvider.getApplicationContext().getBean(NewsRepository.class);
SourceRepository sourceRepository= ApplicationContextProvider.getApplicationContext().getBean(SourceRepository.class);
-----------------------
Spider.create(this).addUrl("http://3g.163.com/touch/reconstruct/article/list/BAI6RHDKwangning/0-1.html")
    .thread(5)
    .run();

Community Discussions

Trending Discussions on webmagic
  • how to find empty use mongodb?
  • can't inject repository when use @Autowired in a non-web-application
Trending Discussions on webmagic

QUESTION

how to find empty use mongodb?

Asked 2018-Jul-10 at 07:06

I want to find out from the database by:

db.xx.find({"fields.name.sourceTexts":null})

or

db.xx.find({"fields.name.sourceTexts":""})

but it not work and find all

[
    {
        "_id": "5b432195e28b99127c59161e",
        "fields": {
            "img": {
                "sourceTexts": [],
                "_class": "us.codecraft.webmagic.selector.PlainText"
            },
            "name": {
                "sourceTexts": [],
                "_class": "us.codecraft.webmagic.selector.PlainText"
            },
            "old": {
                "sourceTexts": [],
                "_class": "us.codecraft.webmagic.selector.PlainText"
            },
            "post": {
                "sourceTexts": [],
                "_class": "us.codecraft.webmagic.selector.PlainText"
            },
            "focusTieba": [],
            "visitor": {
                "sum": 0,
                "list": []
            },
            "follow": {
                "sum": 0,
                "list": []
            },
            "fans": {
                "sum": 0,
                "list": []
            }
        },
        "request": {
            "url": "http://tieba.baidu.com/home/main?un=%DB%A2%D4%B495",
            "cookies": {},
            "headers": {},
            "priority": 0,
            "binaryContent": false
        },
        "skip": false,
        "_class": "us.codecraft.webmagic.ResultItems"
    }
]

ANSWER

Answered 2018-Jul-10 at 06:40

If I got what you need, I think you are in need of next query

db.xx.find({"fields.name.sourceTexts":[]})

Source https://stackoverflow.com/questions/51258616

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install webmagic

You can download it from GitHub, Maven.
You can use webmagic like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the webmagic component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

Explore Related Topics

Share this Page

share link
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.