kandi background

pdf-text-extract | Extract text from pdfs that contain searchable pdf text | Document Editor library

 by   nisaacson JavaScript Version: Current License: BSD-3-Clause

 by   nisaacson JavaScript Version: Current License: BSD-3-Clause

Download this library from

kandi X-RAY | pdf-text-extract Summary

pdf-text-extract is a JavaScript library typically used in Editor, Document Editor applications. pdf-text-extract has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can install using 'npm i pdf-text-extract' or download it from GitHub, npm.
Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext command to perform the actual extraction.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • pdf-text-extract has a low active ecosystem.
  • It has 102 star(s) with 28 fork(s). There are 4 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 10 open issues and 9 have been closed. On average issues are closed in 282 days. There are no pull requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of pdf-text-extract is current.
pdf-text-extract Support
Best in #Document Editor
Average in #Document Editor
pdf-text-extract Support
Best in #Document Editor
Average in #Document Editor

quality kandi Quality

  • pdf-text-extract has 0 bugs and 0 code smells.
pdf-text-extract Quality
Best in #Document Editor
Average in #Document Editor
pdf-text-extract Quality
Best in #Document Editor
Average in #Document Editor

securitySecurity

  • pdf-text-extract has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • pdf-text-extract code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
pdf-text-extract Security
Best in #Document Editor
Average in #Document Editor
pdf-text-extract Security
Best in #Document Editor
Average in #Document Editor

license License

  • pdf-text-extract is licensed under the BSD-3-Clause License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
pdf-text-extract License
Best in #Document Editor
Average in #Document Editor
pdf-text-extract License
Best in #Document Editor
Average in #Document Editor

buildReuse

  • pdf-text-extract releases are not available. You will need to build from source code and install.
  • Deployable package is available in npm.
  • Installation instructions, examples and code snippets are available.
pdf-text-extract Reuse
Best in #Document Editor
Average in #Document Editor
pdf-text-extract Reuse
Best in #Document Editor
Average in #Document Editor
Top functions reviewed by kandi - BETA

kandi has reviewed pdf-text-extract and discovered the below as its top functions. This is intended to give you an instant insight into pdf-text-extract implemented functionality, and help decide if they suit your requirements.

  • Implements the contents of PDFExtract .

pdf-text-extract Key Features

Extract text from pdfs that contain searchable pdf text

pdf-text-extract Examples and Code Snippets

  • Installation
  • As a module
  • As a command line tool
  • Test
  • Python Regex get Unique Multiline Matches
  • PDMiner missing periods

Installation

npm install --save pdf-text-extract

Community Discussions

Trending Discussions on pdf-text-extract
  • Python Regex get Unique Multiline Matches
  • PDMiner missing periods
  • Parse PDF file and output single character locations
Trending Discussions on pdf-text-extract

QUESTION

Python Regex get Unique Multiline Matches

Asked 2021-Mar-07 at 13:58

Since the background would be far to complicated to explain, I am writing Pseudocode, I am only interested in the Python-Regex-Pattern, I hope one of you can help me

I have the folloing input text (lots of lines with \n as line seperator condensed to '.'):

.
.
1 Order 
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
.
.
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck
.
.
Order Summary
.
.

I only want to match the texts in between 'Order' and 'Truck' for each order indiviually, I would then iterate over the results further along in the program.

my Regex: ( i am splitting into "start, content, end" for better readability).

pattern = \d\s*Order + [.|\s|\S]* + Truck

When I execute this match, i get one result, beginning at 1 Order and stopping at the second Truck:

1 Order 
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
.
.
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck

I want (in this case) exactly two matches which only include one order's contents:

1 Order 
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck

I hope it's clear what I am looking for. Any help is greatly appreciated.
Thanks in advance, stay safe, stay healthy!

Things you might suggest:

  • You have to assume varying amounts of Whitespaces at the start of a line and in between words, since the input text is the result of a PDF-text-extractor. But \n can be trusted. Basically insterad of writing \n write \s*\n
  • I can't use "Order" as the end-part of the pattern, since after the last order the next thing is a summary.
  • "ShippingMethod" is different in my language, that's why I used "Truck" for this example here. I will manage to rewrite.

ANSWER

Answered 2021-Mar-07 at 13:35

The solution is deceptively simple - use the non-greedy operator ?.

To begin with, the character class regex [] matches ANY character in it, so to match a and b the regex is [ab] and not [a|b]. So the content part of your code should be [.\s\S].
Also, \s and \S match all spaces and non-spaces respectively, so the period (.) is irrelevant here.

So the final content part should look like this: [\s\S]*

Now for the actual solution:

The greedy ? operator after any normal frequency operator like +, * and ? tells the regex to match as few of the element/s as possible. With *, you're using the default greedy version of zero-or-more, telling the regex to match as many as possible (which ends up matching even the first Truck you want!)

So we add a non-greedy operator at the end, so the final regex looks like this:

\d\s*Order[\s\S]*?Truck

Bonus Advice:

The character class [\s\S] is a neat way to tell the regex to match EVERY CHARACTER (because every character is either a space or not a space). But it turns out there's a way to improve efficiency by using the re.DOTALL modifier. It does what it says - it tells the regex that . (the DOT) should match ALL characters, including newlines.

If this is the code you were using:

re.findall(r'\d\s*Order[\s\S]*?Truck', input_text)

Here's the best possible code (including the solution of the question):

re.findall(r'\d\s*Order.*?Truck', input_text, re.DOTALL)

As you can see, the .*? will now match everything (including newlines) from Order to Truck.

Source https://stackoverflow.com/questions/66516866

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pdf-text-extract

You will need the pdftotext binary available on your path. There are packages available for many different operating systems. See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext command.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

DOWNLOAD this Library from

Explore Related Topics

Build your Application

Share this kandi XRay Report

Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit