pdf-text-extract | Extract text from pdfs that contain searchable pdf text | Document Editor library
kandi X-RAY | pdf-text-extract Summary
kandi X-RAY | pdf-text-extract Summary
Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext command to perform the actual extraction.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Implements the contents of PDFExtract .
pdf-text-extract Key Features
pdf-text-extract Examples and Code Snippets
Community Discussions
Trending Discussions on pdf-text-extract
QUESTION
Since the background would be far to complicated to explain, I am writing Pseudocode, I am only interested in the Python-Regex-Pattern, I hope one of you can help me
I have the folloing input text (lots of lines with \n
as line seperator condensed to '.'):
ANSWER
Answered 2021-Mar-07 at 13:35The solution is deceptively simple - use the non-greedy operator ?
.
To begin with, the character class regex []
matches ANY character in it, so to match a
and b
the regex is [ab]
and not [a|b]
. So the content part of your code should be [.\s\S]
.
Also, \s
and \S
match all spaces and non-spaces respectively, so the period (.
) is irrelevant here.
So the final content part should look like this: [\s\S]*
The greedy ?
operator after any normal frequency operator like +
, *
and ?
tells the regex to match as few of the element/s as possible. With *
, you're using the default greedy version of zero-or-more, telling the regex to match as many as possible (which ends up matching even the first Truck
you want!)
So we add a non-greedy operator at the end, so the final regex looks like this:
QUESTION
I want to extract the text content of this PDF: https://www.welivesecurity.com/wp-content/uploads/2019/07/ESET_Okrum_and_Ketrican.pdf
Here is my code:
...ANSWER
Answered 2020-Jul-19 at 10:17I don't think this is fixable, because the tool does nothing wrong. After investigation, the PDF writes out a real period, the instruction used is:
QUESTION
I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox
from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg
), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
- In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
- In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext
) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
...ANSWER
Answered 2019-May-17 at 13:29A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer
has a script pdf2txt.py
. Try invoking it with -t xml
. The docs just say XML format. Provides the most information.
But my notes indicate that it will apply the font-metrics and give you a element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six
.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install pdf-text-extract
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page