pdftools | Text Extraction , Rendering and Converting of PDF Documents | Document Editor library

by ropensci C++ Version: v3.2.1 License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | pdftools Summary

pdftools is a C++ library typically used in Editor, Document Editor applications. pdftools has no bugs and it has low support. However pdftools has 6 vulnerabilities and it has a Non-SPDX License. You can download it from GitHub.

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines. The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.

Support

Quality

Security

License

Reuse

Support

pdftools has a low active ecosystem.

It has 464 star(s) with 67 fork(s). There are 28 watchers for this library.

It had no major release in the last 12 months.

There are 46 open issues and 58 have been closed. On average issues are closed in 60 days. There are 3 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of pdftools is v3.2.1

Quality

pdftools has no bugs reported.

Security

pdftools has 6 vulnerability issues reported (0 critical, 1 high, 5 medium, 0 low).

License

pdftools has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

pdftools releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of pdftools

Get all kandi verified functions for this library.

pdftools Key Features

No Key Features are available at this moment for pdftools.

pdftools Examples and Code Snippets

No Code Snippets are available at this moment for pdftools.

Community Discussions

Trending Discussions on pdftools

Memory problems when using lapply for corpus creation

PDF scraping: get company and subsidiaries tables

locating specific columns in a pdf table from R

Reading PDF portfolio in R

c# adobe acrobat SDK: file is still locked after SDK quit

R: Converting Tibbles to a Term Document Matrix

Scraping PDF tables with empty Cells

R webscraper is not outputting pdf text in one row

how to read multiple pages from a pdf using R?

How to extract specific lines that starts with an predefined alphabet from multiple PDF files

QUESTION

Memory problems when using lapply for corpus creation

Asked 2021-Jun-05 at 05:53

My eventual goal is to transform thousands of pdfs into a corpus / document term matrix to conduct some topic modeling. I am using the pdftools package to import my pdfs and work with the tm package for preparing my data for text mining. I managed to import and transform one individual pdf, like this:

...

ANSWER

Answered 2021-Jun-05 at 05:52

You can write a function which has series of steps that you want to execute on each pdf.

Source https://stackoverflow.com/questions/67823934

QUESTION

PDF scraping: get company and subsidiaries tables

Asked 2021-May-26 at 06:59

I am trying to scrape this PDF containing information about company subsidiaries. I have seen many posts using the R package Tabulizer but this, unfortunately, doesn't work on my Mac for some reasons. As Tabulizer uses Java dependencies, I tried installing different versions of Java (6-13) and then reinstalling the packages, still no luck in getting this to work (what happens is when I run extract_tables the R session aborts).

I need to scrape the whole pdf from page 19 onwards and construct a table showing company names and their subsidiaries. In the pdf, names start with any letters/number/symbol, whereas subsidiaries start with either a single or double dot.

So I tried with pdftools and pdftables packages. The code below provides a table similar to the one on page 19:

...

ANSWER

Answered 2021-May-26 at 06:59

Like @Justin Coco hinted, this was a lot of fun. The code ended up a bit more complex than I anticipated, but I think the result should be what you imagined.

I used pdf_data instead of pdf_text so I can work with the position of words.

Source https://stackoverflow.com/questions/67489987

QUESTION

locating specific columns in a pdf table from R

Asked 2021-May-17 at 11:54

I was wondering how I could locate the 2nd & 3rd columns from left in the table on the last page (page 18) of the this pdf document.

I'm using pdftools package, I'm wondering if there is a way to extract the 2nd & 3rd columns from left which are just numeric data?

...

ANSWER

Answered 2020-Dec-31 at 07:03

Making use of some tidy verse packages this could be achieved like so:

Filter for the values in the 2nd and 3rd column. The 2nd column values start at position x=189, the 3rd col at x=252.
Additionally to make sure that we only get the values I first convert to numeric whereby all text gets converted to NA. Note: One of the values has a comma as decimal mark, which I first had to remove.
After getting the values I reshape the dataset using pivot_wider for which I add a row id.
Finally I rename the cols.

Source https://stackoverflow.com/questions/65517465

QUESTION

Reading PDF portfolio in R

Asked 2021-May-06 at 03:27

Is it possible to read/convert PDF portfolios in R?

I usually use pdftools, however, I get an error:

...

ANSWER

Answered 2021-May-06 at 03:27

There seems to be an issue with pdf_convert handling one-page raw pdf data (it wants to use basename(pdf) under these conditions), so I have edited that function so that it also works with the second attached pdf file.

If you only need the first file then you could run this with the original pdf_convert function, but it will give an error with the second file.

If you are interested in rendering raster graphics from the attached files this worked for me:

Source https://stackoverflow.com/questions/67410291

QUESTION

c# adobe acrobat SDK: file is still locked after SDK quit

Asked 2021-Apr-29 at 11:28

I'm working with C# and adobe acrobat SDK. When the program throws an error due to the pdf already being compressed I want to move the pdf.

However, C# complains that the file is being used by another process and I know it has to do with the SDK and not another program. After some debugging I found out that compressPDFOperation.Execute is the culprit.

How can I close it so that I can move the file?

...

ANSWER

Answered 2021-Apr-29 at 11:28

I've found a solution , it's not best practice but I don't know an other way to do it. I've declared all the variables used to execute the compression (sourceFileRef, compressPdfOperation, ...) before the try catch statement and after result.SaveAs(...) I set those variables to null and run the garbage collection.

Source https://stackoverflow.com/questions/67230040

QUESTION

R: Converting Tibbles to a Term Document Matrix

Asked 2021-Apr-09 at 06:39

I am using the R programming language. I learned how to take pdf files from the internet and load them into R. For example, below I load 3 different books by Shakespeare into R:

...

ANSWER

Answered 2021-Apr-09 at 06:39

As the error message suggests, VectorSource only takes 1 argument. You can rbind the datasets together and pass it to VectorSource function.

Source https://stackoverflow.com/questions/67016046

QUESTION

Scraping PDF tables with empty Cells

Asked 2021-Apr-02 at 08:01

I'm using R to pull data from PDFs and so far it has been going well. I just opened up a new batch of PDFs and saw that I have to figure out how to account for empty cells. I haven't found a way to do this, and I have hundreds of pages that I need to go through.

I've included some sample data. I haven't found a way to attach the PDFs here, and these are not posted on the web anywhere. I saved df as a CSV, then copied and pasted that into a word document which I saved as a CSV for this example. Screenshot attached as well.

...

ANSWER

Answered 2021-Apr-01 at 02:09

This looks like a good scenario to use the tabulizer package. It works really well when there are nicely formatted tables like this in the PDF. See the vignette. The best function here for you would be tabulizer::extract_tables. It should also recognize the blank spaces as empty values assuming the PDFs are all well formatted like this.

Source https://stackoverflow.com/questions/66896986

QUESTION

R webscraper is not outputting pdf text in one row

Asked 2021-Mar-31 at 03:30

I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?

The code is below:

...

ANSWER

Answered 2021-Mar-31 at 03:30

Looking at your last function, if I understand correctly, you want to take the url and scrape all the text into the a data frame/tibble and then export it to a csv. Here is how you can do it with just 1 article, and you should be able to loop through some links with a little manipulation (apologies if I am misunderstanding):

Source https://stackoverflow.com/questions/66879402

QUESTION

how to read multiple pages from a pdf using R?

Asked 2021-Feb-22 at 10:14

I have a pdf with 10 pages i want to read all data. i worked on below code but it gives me only first page of data

...

ANSWER

Answered 2021-Feb-22 at 09:10

I just tested your code and it worked fine for me. I give you a simpler way that you may like more.

Source https://stackoverflow.com/questions/66312191

QUESTION

How to extract specific lines that starts with an predefined alphabet from multiple PDF files

Asked 2021-Feb-03 at 01:39

The below code helps me pull first page of every PDF file from the directory.

...

ANSWER

Answered 2021-Feb-03 at 01:39

If you want to search for the lines in the first page you can try with lapply as :

Source https://stackoverflow.com/questions/65794015

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pdftools

On Windows and Mac the binary packages can be installed directly from CRAN:.
The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: