pubMunch | various tools to download , convert and process the full text

by maximilianh Python Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions Vulnerabilities Install Support

kandi X-RAY | pubMunch Summary

pubMunch is a Python library. pubMunch has no bugs, it has no vulnerabilities, it has build file available and it has low support. You can download it from GitHub.

NOTE: There is a Python3 version of this repo now - Ongoing dev work is happening over there. These are the tools that I wrote for the UCSC Genocoding project, see They allow you to download fulltext research articles from the internet, convert them to text and run text mining algorithms on them. All tools start with the prefix "pub".

Support

Quality

Security

License

Reuse

Support

pubMunch has a low active ecosystem.

It has 48 star(s) with 21 fork(s). There are 8 watchers for this library.

It had no major release in the last 6 months.

There are 1 open issues and 5 have been closed. On average issues are closed in 315 days. There are 1 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of pubMunch is current.

Quality

pubMunch has no bugs reported.

Security

pubMunch has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

pubMunch does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

pubMunch releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed pubMunch and discovered the below as its top functions. This is intended to give you an instant insight into pubMunch implemented functionality, and help decide if they suit your requirements.

Decorator to log the phase of each token .
Create a configuration dictionary for highwire publishers .
Get stylesheet .
Return an Element Builder .
Creates a DOM builder .
Compile regex patterns
Parse Elsevier metadata .
returns a dictionary of publication counts
Crawl files via Pubmed .
Parse the NLM XML file .

Get all kandi verified functions for this library.

pubMunch Key Features

No Key Features are available at this moment for pubMunch.

pubMunch Examples and Code Snippets

No Code Snippets are available at this moment for pubMunch.

Community Discussions

No Community Discussions are available at this moment for pubMunch.Refer to stack overflow page for discussions.

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pubMunch

Install these packages on ubuntu: sudo apt-get install catdoc poppler-utils docx2text gnumeric python-lxml.
catdoc contains various converters for Microsoft Office files
poppler-utils contains one of the pdftotext converters
docx2text is a perl script for docx files
gnumeric includes the ssconvert tools for xslx Excel files
python-lxml is a fast xml/html parser
html2text is required, used for the html -> text conversion (written by Aaron Schwartz)
requests is very useful for pubCraw2 and highly recommended
selenium is only be optionally used to crawl karger journals. Not required.

Support

fixme: illegal DOI landing page http://www.nature.com/doifinder/10.1046/j.1523-1747.1998.00092.x. URL constructor: http://www.nature.com/nature/journal/v437/n7062/full/4371102a.html for DOI doi:10.1038/4371102a. URL construction for supplemental files: http://www.nature.com/bjc/journal/v103/n10/suppinfo/6605908s1.html. no access page: http://www.nature.com/nrclinonc/journal/v7/n11/full/nrclinonc.2010.119.html. cat /cluster/home/max/projects/pubs/crawlDir/rupress/articleMeta.tab | head -n13658 | tail -n2 > problem.txt.

Find more information at: