kandi has reviewed pdf-text-extract and discovered the below as its top functions. This is intended to give you an instant insight into pdf-text-extract implemented functionality, and help decide if they suit your requirements.
Extract text from pdfs that contain searchable pdf text
npm install --save pdf-text-extract
Python Regex get Unique Multiline MatchesAsked 2021-Mar-07 at 13:58
Since the background would be far to complicated to explain, I am writing Pseudocode, I am only interested in the Python-Regex-Pattern, I hope one of you can help me
I have the folloing input text (lots of lines with
\n as line seperator condensed to '.'):
. . 1 Order order1 stuff order1 stuff etc ShippingMethod: Truck . . 2 Order order2 stuff order2 stuff etc ShippingMethod: Truck . . Order Summary . .
I only want to match the texts in between 'Order' and 'Truck' for each order indiviually, I would then iterate over the results further along in the program.
my Regex: ( i am splitting into "start, content, end" for better readability).
pattern = \d\s*Order + [.|\s|\S]* + Truck
When I execute this match, i get one result, beginning at
1 Order and stopping at the second
1 Order order1 stuff order1 stuff etc ShippingMethod: Truck . . 2 Order order2 stuff order2 stuff etc ShippingMethod: Truck
I want (in this case) exactly two matches which only include one order's contents:
1 Order order1 stuff order1 stuff etc ShippingMethod: Truck
2 Order order2 stuff order2 stuff etc ShippingMethod: Truck
I hope it's clear what I am looking for. Any help is greatly appreciated.
Thanks in advance, stay safe, stay healthy!
Things you might suggest:
ANSWERAnswered 2021-Mar-07 at 13:35
The solution is deceptively simple - use the non-greedy operator
To begin with, the character class regex
 matches ANY character in it, so to match
b the regex is
[ab] and not
[a|b]. So the content part of your code should be
\S match all spaces and non-spaces respectively, so the period (
.) is irrelevant here.
So the final content part should look like this:
? operator after any normal frequency operator like
? tells the regex to match as few of the element/s as possible. With
*, you're using the default greedy version of zero-or-more, telling the regex to match as many as possible (which ends up matching even the first
Truck you want!)
So we add a non-greedy operator at the end, so the final regex looks like this:
The character class
[\s\S] is a neat way to tell the regex to match EVERY CHARACTER (because every character is either a space or not a space). But it turns out there's a way to improve efficiency by using the
re.DOTALL modifier. It does what it says - it tells the regex that
. (the DOT) should match ALL characters, including newlines.
If this is the code you were using:
Here's the best possible code (including the solution of the question):
re.findall(r'\d\s*Order.*?Truck', input_text, re.DOTALL)
As you can see, the
.*? will now match everything (including newlines) from
No vulnerabilities reported
Save this library and start creating your kit