pdf-text-extract | Extract text from pdfs that contain searchable pdf text | Document Editor library

by nisaacson JavaScript Version: 1.5.0 License: BSD-3-Clause

X-Ray Key Features Code Snippets Community Discussions(3)Vulnerabilities Install Support

kandi X-RAY | pdf-text-extract Summary

pdf-text-extract is a JavaScript library typically used in Editor, Document Editor applications. pdf-text-extract has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can install using 'npm i pdf-text-extract' or download it from GitHub, npm.

Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext command to perform the actual extraction.

Support

Quality

Security

License

Reuse

Support

pdf-text-extract has a low active ecosystem.

It has 102 star(s) with 28 fork(s). There are 4 watchers for this library.

It had no major release in the last 12 months.

There are 10 open issues and 9 have been closed. On average issues are closed in 282 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of pdf-text-extract is 1.5.0

Quality

pdf-text-extract has 0 bugs and 0 code smells.

Security

pdf-text-extract has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

pdf-text-extract code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

pdf-text-extract is licensed under the BSD-3-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

pdf-text-extract releases are not available. You will need to build from source code and install.

Deployable package is available in npm.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed pdf-text-extract and discovered the below as its top functions. This is intended to give you an instant insight into pdf-text-extract implemented functionality, and help decide if they suit your requirements.

Implements the contents of PDFExtract .

Get all kandi verified functions for this library.

pdf-text-extract Key Features

No Key Features are available at this moment for pdf-text-extract.

pdf-text-extract Examples and Code Snippets

No Code Snippets are available at this moment for pdf-text-extract.

Community Discussions

Trending Discussions on pdf-text-extract

Python Regex get Unique Multiline Matches

PDMiner missing periods

Parse PDF file and output single character locations

QUESTION

Python Regex get Unique Multiline Matches

Asked 2021-Mar-07 at 13:58

Since the background would be far to complicated to explain, I am writing Pseudocode, I am only interested in the Python-Regex-Pattern, I hope one of you can help me

I have the folloing input text (lots of lines with \n as line seperator condensed to '.'):

...

ANSWER

Answered 2021-Mar-07 at 13:35

The solution is deceptively simple - use the non-greedy operator ?.

To begin with, the character class regex [] matches ANY character in it, so to match a and b the regex is [ab] and not [a|b]. So the content part of your code should be [.\s\S].
Also, \s and \S match all spaces and non-spaces respectively, so the period (.) is irrelevant here.

So the final content part should look like this: [\s\S]*

Now for the actual solution:

The greedy ? operator after any normal frequency operator like +, * and ? tells the regex to match as few of the element/s as possible. With *, you're using the default greedy version of zero-or-more, telling the regex to match as many as possible (which ends up matching even the first Truck you want!)

So we add a non-greedy operator at the end, so the final regex looks like this:

Source https://stackoverflow.com/questions/66516866

QUESTION

PDMiner missing periods

Asked 2020-Jul-20 at 07:55

I want to extract the text content of this PDF: https://www.welivesecurity.com/wp-content/uploads/2019/07/ESET_Okrum_and_Ketrican.pdf

Here is my code:

...

ANSWER

Answered 2020-Jul-19 at 10:17

I don't think this is fixable, because the tool does nothing wrong. After investigation, the PDF writes out a real period, the instruction used is:

Source https://stackoverflow.com/questions/62974577

QUESTION

Parse PDF file and output single character locations

Asked 2019-May-17 at 13:29

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.

What I've tried

The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.

Problems

While the above works for most "basic" fonts, there are two (main) situations where this approach fails:

In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.

The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.

What I would hope

It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...

I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...

...

ANSWER

Answered 2019-May-17 at 13:29

A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.

The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a element for every single glyph, with font and bounding-box info.

There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

Source https://stackoverflow.com/questions/56172546

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pdf-text-extract

You will need the pdftotext binary available on your path. There are packages available for many different operating systems. See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext command.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: