kandi X-RAY | pdf-text-extract Summary
kandi X-RAY | pdf-text-extract Summary
Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext command to perform the actual extraction.
Top functions reviewed by kandi - BETA
- Implements the contents of PDFExtract .
pdf-text-extract Key Features
pdf-text-extract Examples and Code Snippets
Trending Discussions on pdf-text-extract
Since the background would be far to complicated to explain, I am writing Pseudocode, I am only interested in the Python-Regex-Pattern, I hope one of you can help me
I have the folloing input text (lots of lines with
\n as line seperator condensed to '.'):
ANSWERAnswered 2021-Mar-07 at 13:35
The solution is deceptively simple - use the non-greedy operator
To begin with, the character class regex
 matches ANY character in it, so to match
b the regex is
[ab] and not
[a|b]. So the content part of your code should be
\S match all spaces and non-spaces respectively, so the period (
.) is irrelevant here.
So the final content part should look like this:
? operator after any normal frequency operator like
? tells the regex to match as few of the element/s as possible. With
*, you're using the default greedy version of zero-or-more, telling the regex to match as many as possible (which ends up matching even the first
Truck you want!)
So we add a non-greedy operator at the end, so the final regex looks like this:
I want to extract the text content of this PDF: https://www.welivesecurity.com/wp-content/uploads/2019/07/ESET_Okrum_and_Ketrican.pdf
Here is my code:...
ANSWERAnswered 2020-Jul-19 at 10:17
I don't think this is fixable, because the tool does nothing wrong. After investigation, the PDF writes out a real period, the instruction used is:
I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words,
pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via
pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
- In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
- In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like
pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill......
ANSWERAnswered 2019-May-17 at 13:29
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package
pdfminer has a script
pdf2txt.py. Try invoking it with
-t xml. The docs just say
XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a
element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for
No vulnerabilities reported
Reuse Trending Solutions
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page