testarea-pdfbox2 | Test area for public PDFBox v2 issues | Document Editor library

by mkl-public Java Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | testarea-pdfbox2 Summary

testarea-pdfbox2 is a Java library typically used in Editor, Document Editor, React Native applications. testarea-pdfbox2 has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

Test area for public PDFBox v2 issues on stackoverflow etc

Support

Quality

Security

License

Reuse

Support

testarea-pdfbox2 has a low active ecosystem.

It has 39 star(s) with 29 fork(s). There are 7 watchers for this library.

It had no major release in the last 6 months.

There are 2 open issues and 7 have been closed. On average issues are closed in 2 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of testarea-pdfbox2 is current.

Quality

testarea-pdfbox2 has no bugs reported.

Security

testarea-pdfbox2 has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

testarea-pdfbox2 is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

testarea-pdfbox2 releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of testarea-pdfbox2

Get all kandi verified functions for this library.

testarea-pdfbox2 Key Features

No Key Features are available at this moment for testarea-pdfbox2.

testarea-pdfbox2 Examples and Code Snippets

No Code Snippets are available at this moment for testarea-pdfbox2.

Community Discussions

Trending Discussions on testarea-pdfbox2

Why are there invisible characters in my PDF and how do I filter them out with PDFBox?

How to dense merge PDF files using PDFBox 2 without whitespace near page breaks?

QUESTION

Why are there invisible characters in my PDF and how do I filter them out with PDFBox?

Asked 2020-Dec-11 at 10:26

I'm using PDFBox to extract text from a document by extending PDFTextStripper. I've noticed that some of these documents contain invisible characters that are being extracted. I'd like to filter out these invisible characters.

I see that there are already some stackoverflow posts on this, for example:

I tried subclassing the PDFVisibleTextStripper class found here:

https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/main/java/mkl/testarea/pdfbox2/extract/PDFVisibleTextStripper.java

However, I found that this filtered out text that was in fact visible. I used it as a drop-in-replacement for PDFTextStripper.

...

ANSWER

Answered 2020-Dec-11 at 10:26

I figured out what's going on. The PDF contains a clipping rectangle that does not include 'a.'. I tried using PDFVisibleTextStripper but that stripped out text elsewhere in other documents that was in fact visible.

In the end, I wrote a class that inherits from PageDrawer and implements the showGlyph method to access the characters being drawn on the page. This method checks if the bounding box of the character is outside getGraphicsState().getCurrentClippingPath().getBounds2D().

This unfortunately means I'm not using PDFTextStripper anymore so I had to reimplement bits of its behaviour such as sorting characters by position (I was using setSortByPosition(true)). It was also a bit tricky to calculate the correct bounding box of the character based on font size and displacement.

ExtractChars.java

Source https://stackoverflow.com/questions/65169010

QUESTION

How to dense merge PDF files using PDFBox 2 without whitespace near page breaks?

Asked 2020-Feb-10 at 14:37

We have been using the iText based PdfVeryDenseMergeTool we found in this SO question How To Remove Whitespace on Merge to merge multiple PDF files into a single PDF file. The tool merges PDFs without leaving any whitespace in between, and individual PDFs also get broken out across pages when possible.

We want to port PdfVeryDenseMergeTool to PDFBox. We found a PDFBox 2 based PdfDenseMergeTool that merges PDFs like this:

Individual PDFs:

Dense Merged PDF:

We are looking for something like this (this is already one in iText based PdfVeryDenseMergeTool but we want to do it using PDFBox 2) :

In our attempt to do the porting, we found that PdfVeryDenseMergeTool uses a PageVerticalAnalyzer that extends iText PDF Render Listener and does something every time a text, image, or arc is drawn in a PDF. And all the rendering info is then used to split an individual PDF across multiple pages. We tried looking for a similar PDF Render Listener in PDFBox 2 but found that the available PDFRenderer class only has image rendering methods. So we are not sure how to port PageVerticalAnalyzer to PDFBox.

If someone can suggest an approach to move forward, we'd greatly appreciate their help.

Thanks a lot!

EDIT 7 Feb 2020

At present, we are extending PDFGraphicsStreamEngine from PDFBox to make a custom rendering engine that tracks coordinates of images, text lines, and arcs when they are drawn. That custom engine will be the port of the PageVerticalAnalyzer. After that, we are hoping to be able to port PdfVeryDenseMergeTool to PDFBox.

EDIT 8 Feb 2020

Here is a very simple port of PageVerticalAnalyzer that handles images and text. I'm a PDFBox newbie, so my logic to handle images is probably wonky. Here's the basic approach:

Text: for every glyph printed, get the bottomY and make topY = bottomY + charHeight, mark those top/bottom points.

Image: for every call to drawImage(), it looks like there are two ways to figure out where it was drawn. First is using the coords from the last call to appendRectangle() and second is using the last calls to moveTo(), multiple lineTo(), and closePath(). I give the latter one priority. If I can't find any path (I found it in one PDF, in another, before drawImage(), I only found appendRectangle()), I use the former. If none of them exist, I have no clue what to do. Here's how I'm assuming PDFBox marks image coords using moveTo()/lineTo()/closePath():

Here is my current implementation:

...

ANSWER

Answered 2020-Feb-10 at 14:37

This answer suffers from the same issues as the original iText version does.

A port of the PageVerticalAnalyzer

One can port the PageVerticalAnalyzer as follows from iText to PDFBox:

Source https://stackoverflow.com/questions/60052967

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install testarea-pdfbox2

You can download it from GitHub.
You can use testarea-pdfbox2 like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the testarea-pdfbox2 component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: