testarea-pdfbox2 | Test area for public PDFBox v2 issues | Document Editor library
kandi X-RAY | testarea-pdfbox2 Summary
kandi X-RAY | testarea-pdfbox2 Summary
Test area for public PDFBox v2 issues on stackoverflow etc
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of testarea-pdfbox2
testarea-pdfbox2 Key Features
testarea-pdfbox2 Examples and Code Snippets
Community Discussions
Trending Discussions on testarea-pdfbox2
QUESTION
I'm using PDFBox to extract text from a document by extending PDFTextStripper. I've noticed that some of these documents contain invisible characters that are being extracted. I'd like to filter out these invisible characters.
I see that there are already some stackoverflow posts on this, for example:
- PDFBox - Removing invisible text (by clip/filling paths issue)
- remove invisible text from pdf using pdfbox
I tried subclassing the PDFVisibleTextStripper
class found here:
However, I found that this filtered out text that was in fact visible. I used it as a drop-in-replacement for PDFTextStripper
.
ANSWER
Answered 2020-Dec-11 at 10:26I figured out what's going on. The PDF contains a clipping rectangle that does not include 'a.'. I tried using PDFVisibleTextStripper
but that stripped out text elsewhere in other documents that was in fact visible.
In the end, I wrote a class that inherits from PageDrawer
and implements the showGlyph
method to access the characters being drawn on the page. This method checks if the bounding box of the character is outside getGraphicsState().getCurrentClippingPath().getBounds2D()
.
This unfortunately means I'm not using PDFTextStripper
anymore so I had to reimplement bits of its behaviour such as sorting characters by position (I was using setSortByPosition(true)
). It was also a bit tricky to calculate the correct bounding box of the character based on font size and displacement.
ExtractChars.java
QUESTION
We have been using the iText based PdfVeryDenseMergeTool we found in this SO question How To Remove Whitespace on Merge to merge multiple PDF files into a single PDF file. The tool merges PDFs without leaving any whitespace in between, and individual PDFs also get broken out across pages when possible.
We want to port PdfVeryDenseMergeTool to PDFBox. We found a PDFBox 2 based PdfDenseMergeTool that merges PDFs like this:
Individual PDFs:
Dense Merged PDF:
We are looking for something like this (this is already one in iText based PdfVeryDenseMergeTool but we want to do it using PDFBox 2) :
In our attempt to do the porting, we found that PdfVeryDenseMergeTool uses a PageVerticalAnalyzer that extends iText PDF Render Listener and does something every time a text, image, or arc is drawn in a PDF. And all the rendering info is then used to split an individual PDF across multiple pages. We tried looking for a similar PDF Render Listener in PDFBox 2 but found that the available PDFRenderer class only has image rendering methods. So we are not sure how to port PageVerticalAnalyzer to PDFBox.
If someone can suggest an approach to move forward, we'd greatly appreciate their help.
Thanks a lot!
EDIT 7 Feb 2020
At present, we are extending PDFGraphicsStreamEngine from PDFBox to make a custom rendering engine that tracks coordinates of images, text lines, and arcs when they are drawn. That custom engine will be the port of the PageVerticalAnalyzer. After that, we are hoping to be able to port PdfVeryDenseMergeTool to PDFBox.
EDIT 8 Feb 2020
Here is a very simple port of PageVerticalAnalyzer that handles images and text. I'm a PDFBox newbie, so my logic to handle images is probably wonky. Here's the basic approach:
Text: for every glyph printed, get the bottomY and make topY = bottomY + charHeight, mark those top/bottom points.
Image: for every call to drawImage(), it looks like there are two ways to figure out where it was drawn. First is using the coords from the last call to appendRectangle() and second is using the last calls to moveTo(), multiple lineTo(), and closePath(). I give the latter one priority. If I can't find any path (I found it in one PDF, in another, before drawImage(), I only found appendRectangle()), I use the former. If none of them exist, I have no clue what to do. Here's how I'm assuming PDFBox marks image coords using moveTo()/lineTo()/closePath():
Here is my current implementation:
...ANSWER
Answered 2020-Feb-10 at 14:37This answer suffers from the same issues as the original iText version does.
A port of thePageVerticalAnalyzer
One can port the PageVerticalAnalyzer
as follows from iText to PDFBox:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install testarea-pdfbox2
You can use testarea-pdfbox2 like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the testarea-pdfbox2 component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page