ocr | ocr.sh : a bash script to OCR PDF files | Computer Vision library

by vrasneur Shell Version: Current License: Unlicense

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | ocr Summary

ocr is a Shell library typically used in Artificial Intelligence, Computer Vision applications. ocr has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

ocr.sh: a bash script to OCR PDF files easily

Support

Quality

Security

License

Reuse

Support

ocr has a low active ecosystem.

It has 13 star(s) with 3 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

ocr has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of ocr is current.

Quality

ocr has no bugs reported.

Security

ocr has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

ocr is licensed under the Unlicense License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

ocr releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of ocr

Get all kandi verified functions for this library.

ocr Key Features

No Key Features are available at this moment for ocr.

ocr Examples and Code Snippets

No Code Snippets are available at this moment for ocr.

Community Discussions

Trending Discussions on ocr

General approach to parsing text with special characters from PDF using Tesseract?

what is the best regular expression to replace non numeric character in a string preceded by certain phrase in python?

Using Google ML-Kit On-Device Text Recognition in Flutter

Python: Speed of loop drastically increases if different run order?

Generating similar words for OCR

How can I make my program input images taken from a Camera?

How to improve Hindi text extraction?

How to remove text from the sketched image

Batch OCR files in subfolders and save new files with new name

text recognition and restructuring OCR opencv

QUESTION

General approach to parsing text with special characters from PDF using Tesseract?

Asked 2021-Jun-15 at 20:17

I would like to extract the definitions from the book The Navajo Language: A Grammar and Colloquial Dictionary by Young and Morgan. They look like this (very blurry):

I tried running it through the Google Cloud Vision API, and got decent results, but it doesn't know what to do with these "special" letters with accent marks on them, or the curls and lines on/through them. And because of the blurryness (there are no alternative sources of the PDF), it gets a lot of them wrong. So I'm thinking of doing it from scratch in Tesseract. Note the term is bold and the definition is not bold.

How can I use Node.js and Tesseract to get basically an array of JSON objects sort of like this:

...

ANSWER

Answered 2021-Jun-15 at 20:17

Tesseract takes a lang variable that you can expand to include different languages if they're installed. I've used the UB Mannheim (https://github.com/UB-Mannheim/tesseract/wiki) installation which includes a ton of languages supported.

To get better and more accurate results, the best thing to do is to process the image before handing it to Tesseract. Set a white/black threshold so that you have black text on white background with no shading. I'm not sure how to do this in Node, but I've done it with Python's OpenCV library.

If that font doesn't get you decent results with the out of the box, then you'll want to train your own, yes. This blog post walks through the process in great detail: https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6. It revolves around using the jTessBoxEditor to hand-label the objects detected in the images you're using.

Edit: In brief, the process to train your own:

Install jTessBoxEditor (https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/). Requires Java Runtime installed as well.
Collect your training images. They want to be .tiffs. I found I got fairly accurate results with not a whole lot of images that had a good sample of all the characters I wanted to detect. Maybe 30/40 images. It's tedious, so you don't want to do TOO many, but need enough in order to get a good sampling.
Use jTessBoxEditor to merge all the images into a single .tiff
Create a training label file (.box)j. This is done with Tesseract itself. tesseract your_language.font.exp0.tif your_language.font.exp0 makebox
Now you can open the box file in jTessBoxEditor and you'll see how/where it detected the characters. Bounding boxes and what character it saw. The tedious part: Hand fix all the bounding boxes and characters to accurately represent what is in the images. Not joking, it's tedious. Slap some tv episodes up and just churn through it.
Train the tesseract model itself

save a file: font_properties who's content is font 0 0 0 0 0
run the following commands:

tesseract num.font.exp0.tif font_name.font.exp0 nobatch box.train

unicharset_extractor font_name.font.exp0.box

shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

cntraining font_name.font.exp0.tr

You should, in there close to the end see some output that looks like this:

Master shape_table:Number of shapes = 10 max unichars = 1 number with multiple unichars = 0

That number of shapes should roughly be the number of characters present in all the image files you've provided.

If it went well, you should have 4 files created: inttemp normproto pffmtable shapetable. Rename them all with the prefix of your_language from before. So e.g. your_language.inttemp etc.

Then run:

combine_tessdata your_language

The file: your_language.traineddata is the model. Copy that into your Tesseract's data folder. On Windows, it'll be like: C:\Program Files x86\tesseract\4.0\tessdata and on Linux it's probably something like /usr/shared/tesseract/4.0/tessdata.

Then when you run Tesseract, you'll pass the lang=your_language. I found best results when I still passed an existing language as well, so like for my stuff it was still English I was grabbing, just funny fonts. So I still wanted the English as well, so I'd pass: lang=your_language+eng.

Source https://stackoverflow.com/questions/67991718

QUESTION

what is the best regular expression to replace non numeric character in a string preceded by certain phrase in python?

Asked 2021-Jun-15 at 20:02

I have to parse lists of names, addresses, etc. that were OCRed and have invalid/incorrect characters in them and on the state postal code I need to recognize the pattern with a 2 character state followed by a 5 digit postal code and replace any non numeric characters in the postal code. I might have OK 7-41.03 at the end of a string I need to remove the hyphen and period. I know that re.sub('[^0-9]+', '', '7-41.03') will remove the desired characters but I need it only replace characters in numbers when found at the end of the string and only if preceded by a two character state wrapped in spaces like OK. It seems if I add anything to the regular expression as far as a lookbehind expression then I can't seem to get the characters replaced. I've come up with the following but I think there must be a simpler expression to accomplish this. Example:

...

ANSWER

Answered 2021-Jun-15 at 20:02

You need to make use of re.sub callbacks:

Source https://stackoverflow.com/questions/67990895

QUESTION

Using Google ML-Kit On-Device Text Recognition in Flutter

Asked 2021-Jun-15 at 08:04

Is it possible to use Google ML-Kit On-Device Text Recognition in Flutter? All of the tutorials and resources I am finding online are all firebase_ml_vision, but I am looking for one that uses the no-cost OCR from Google ML-Kit. How would I do this in Flutter?

EDIT: SOLVED - when I posted this the package was not there, but now it is.

...

ANSWER

Answered 2021-Jun-01 at 21:28

Yes surely you can use this package [https://pub.dev/packages/mlkit][1] this is google's mlkit. OCR has also support for both ios and android. Happy Coding ;)

Source https://stackoverflow.com/questions/66084486

QUESTION

Python: Speed of loop drastically increases if different run order?

Asked 2021-Jun-13 at 23:19

As I'm working on a script to correct formatting errors from documents produced by OCR, I ran into an issue where, depending on which loop I run first, my program runs about 80% slower.

Here is a simplified version of my code. I have the following loop to check for uppercase errors (e.g., "posSible"):

...

ANSWER

Answered 2021-Jun-13 at 23:19

headingsFix strips out all the line endings, which you presumably did not intend. However, your question is about why changing the order of transformations results in slower execution, so I'll not discuss fixing that here.

fixUppercase is extremely inefficient at handling lines with many words. It repeatedly calls line.split() over and over again on the entire book-length string. That isn't terribly slow if each line has maybe a dozen words, but it gets extremely slow if you have one enormous line with tens of thousands of words. I found your program runs vastly faster with this change to only split each line once. (I note that I can't say whether your program is correct as it stands, just that this change should have the same behaviour while being a lot faster. I'm afraid I don't particularly understand why it's comparing each word to see if it's the same as the last word on the line.)

Source https://stackoverflow.com/questions/67953901

QUESTION

Generating similar words for OCR

Asked 2021-Jun-13 at 22:18

so first of this is my first time asking a question here so forgive me if I make any mistakes. My Problem is as follows: I'm using python to sort through a bunch of images. The images are sorted by many criteria, one of which is the text inside the Image. I've got OCR working and have a list of "bad" words which arent supposed to be in the Image. The problem is that the OCR often confuses some letters, for example e and a. The question is if there is an easy way to generate similar looking words. Like create_similar("test") And output would be ["test", "tast" "lest"] and so on. So I could use that as the list of Bad words and avoid false negatives. If I'm just missing a really obvious solution, please tell me. I've been trying for hours now and just can't get it to work.

...

ANSWER

Answered 2021-Jun-13 at 22:18

I really recommend this article by Peter Norvig on how to build a spelling corrector. In it, you will find the following function that returns a set of all the edited strings (whether words or not) that can be made with one simple edit. A simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter).

Source https://stackoverflow.com/questions/67962929

QUESTION

How can I make my program input images taken from a Camera?

Asked 2021-Jun-13 at 13:31

I am working on a python program that reads license plates from trucks. An image that gets processed by this program and filters the characters as output. Here is the input of the image in the program:

...

ANSWER

Answered 2021-Jun-13 at 13:31

You can capture a single frame by using the VideoCapture method of OpenCV.

Source https://stackoverflow.com/questions/67933828

QUESTION

How to improve Hindi text extraction?

Asked 2021-Jun-11 at 20:13

I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn't work, but no answers as such. So, I decided to convert the PDF to an image, and then use pytesseract to extract texts. I have downloaded the Hindi trained data, however that also gives highly inaccurate text.

That's the actual Hindi text from the PDF (download link):

That's my code so far:

...

ANSWER

Answered 2021-Jun-08 at 14:46

It seems the module pdfplumber does the work:

Source https://stackoverflow.com/questions/67816185

QUESTION

How to remove text from the sketched image

Asked 2021-Jun-10 at 04:07

I have some sketched images where the images contain text captions. I am trying to remove those caption.

I am using this code:

...

ANSWER

Answered 2021-Jun-09 at 20:15

The cv2 pre-processing is unecessary here, tesseract is able to find the text on its own. See the example below, commented inline:

Source https://stackoverflow.com/questions/67910691

QUESTION

Batch OCR files in subfolders and save new files with new name

Asked 2021-Jun-09 at 21:32

I have the following code, which OCR's all PDF files in a specific folder (d:\extracttmp2), but it does not rename the files as I would like, or put the new files in the right place.

Currently, all files are within subfolders of 'extracttmp2'.

The OCR runs correctly, but I would like the OCR'ed files to be renamed to: -_ocred.pdf. Naming them in such a manner will produce no file overwrites.

Currently, the code OCR's the files, but it saves the new files to the folder above the folder they are located in. It also saves the filenames as "JAN_ocred.pdf", for example, for a file named "JAN.pdf". The result of saving up one folder leads to some file overwrites, which is unwanted.

Also, it doesn't matter if the OCR'ed files remain in the folder where the un-OCR'ed files are located, or if they're saved up one folder. The desired renaming will eliminate any overwrites.

The software I'm using is PDF24. https://creator.pdf24.org/manual/10/#command-line. However, I think that my problem is not with the OCR software, but my syntax in the batch script.

Can anyone tell me what I am doing wrong?

...

ANSWER

Answered 2021-Jun-09 at 21:32

Is this what you mean? i.e. files will be saved in the same location as before, but each name will be prefixed with their parent directories' name, followed by a hyphen/dash.

Source https://stackoverflow.com/questions/67910456

QUESTION

text recognition and restructuring OCR opencv

Asked 2021-Jun-08 at 12:14

Link to original image https://ibb.co/0VC6vkX

I am currently working with an OCR Project. I pre-processed the image, and then applied pre-trained EAST model for text detection.

...

ANSWER

Answered 2021-Jun-07 at 07:02

Here's a possible solution that you can try improving on by trying a few things:

by varying Gaussian parameters
by thresholding the blurred image to see if it improves the result

Code:

Source https://stackoverflow.com/questions/67763853

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install ocr

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: