ocr | ocr.sh : a bash script to OCR PDF files | Computer Vision library
kandi X-RAY | ocr Summary
kandi X-RAY | ocr Summary
ocr.sh: a bash script to OCR PDF files easily
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of ocr
ocr Key Features
ocr Examples and Code Snippets
Community Discussions
Trending Discussions on ocr
QUESTION
I would like to extract the definitions from the book The Navajo Language: A Grammar and Colloquial Dictionary by Young and Morgan. They look like this (very blurry):
I tried running it through the Google Cloud Vision API, and got decent results, but it doesn't know what to do with these "special" letters with accent marks on them, or the curls and lines on/through them. And because of the blurryness (there are no alternative sources of the PDF), it gets a lot of them wrong. So I'm thinking of doing it from scratch in Tesseract. Note the term is bold and the definition is not bold.
How can I use Node.js and Tesseract to get basically an array of JSON objects sort of like this:
...ANSWER
Answered 2021-Jun-15 at 20:17Tesseract takes a lang
variable that you can expand to include different languages if they're installed. I've used the UB Mannheim (https://github.com/UB-Mannheim/tesseract/wiki) installation which includes a ton of languages supported.
To get better and more accurate results, the best thing to do is to process the image before handing it to Tesseract. Set a white/black threshold so that you have black text on white background with no shading. I'm not sure how to do this in Node, but I've done it with Python's OpenCV library.
If that font doesn't get you decent results with the out of the box, then you'll want to train your own, yes. This blog post walks through the process in great detail: https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6. It revolves around using the jTessBoxEditor to hand-label the objects detected in the images you're using.
Edit: In brief, the process to train your own:
- Install jTessBoxEditor (https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/). Requires Java Runtime installed as well.
- Collect your training images. They want to be .tiffs. I found I got fairly accurate results with not a whole lot of images that had a good sample of all the characters I wanted to detect. Maybe 30/40 images. It's tedious, so you don't want to do TOO many, but need enough in order to get a good sampling.
- Use jTessBoxEditor to merge all the images into a single .tiff
- Create a training label file (.box)j. This is done with Tesseract itself.
tesseract your_language.font.exp0.tif your_language.font.exp0 makebox
- Now you can open the box file in jTessBoxEditor and you'll see how/where it detected the characters. Bounding boxes and what character it saw. The tedious part: Hand fix all the bounding boxes and characters to accurately represent what is in the images. Not joking, it's tedious. Slap some tv episodes up and just churn through it.
- Train the tesseract model itself
- save a file:
font_properties
who's content isfont 0 0 0 0 0
- run the following commands:
tesseract num.font.exp0.tif font_name.font.exp0 nobatch box.train
unicharset_extractor font_name.font.exp0.box
shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr
mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr
cntraining font_name.font.exp0.tr
You should, in there close to the end see some output that looks like this:
Master shape_table:Number of shapes = 10 max unichars = 1 number with multiple unichars = 0
That number of shapes should roughly be the number of characters present in all the image files you've provided.
If it went well, you should have 4 files created: inttemp
normproto
pffmtable
shapetable
. Rename them all with the prefix of your_language
from before. So e.g. your_language.inttemp
etc.
Then run:
combine_tessdata your_language
The file: your_language.traineddata
is the model. Copy that into your Tesseract's data folder. On Windows, it'll be like: C:\Program Files x86\tesseract\4.0\tessdata
and on Linux it's probably something like /usr/shared/tesseract/4.0/tessdata
.
Then when you run Tesseract, you'll pass the lang=your_language
. I found best results when I still passed an existing language as well, so like for my stuff it was still English I was grabbing, just funny fonts. So I still wanted the English as well, so I'd pass: lang=your_language+eng
.
QUESTION
I have to parse lists of names, addresses, etc. that were OCRed and have invalid/incorrect characters in them and on the state postal code I need to recognize the pattern with a 2 character state followed by a 5 digit postal code and replace any non numeric characters in the postal code. I might have OK 7-41.03
at the end of a string I need to remove the hyphen and period. I know that re.sub('[^0-9]+', '', '7-41.03')
will remove the desired characters but I need it only replace characters in numbers when found at the end of the string and only if preceded by a two character state wrapped in spaces like OK
. It seems if I add anything to the regular expression as far as a lookbehind expression then I can't seem to get the characters replaced. I've come up with the following but I think there must be a simpler expression to accomplish this. Example:
ANSWER
Answered 2021-Jun-15 at 20:02You need to make use of re.sub
callbacks:
QUESTION
Is it possible to use Google ML-Kit On-Device Text Recognition in Flutter? All of the tutorials and resources I am finding online are all firebase_ml_vision
, but I am looking for one that uses the no-cost OCR from Google ML-Kit. How would I do this in Flutter?
EDIT: SOLVED - when I posted this the package was not there, but now it is.
...ANSWER
Answered 2021-Jun-01 at 21:28Yes surely you can use this package [https://pub.dev/packages/mlkit][1] this is google's mlkit. OCR has also support for both ios and android. Happy Coding ;)
QUESTION
As I'm working on a script to correct formatting errors from documents produced by OCR, I ran into an issue where, depending on which loop I run first, my program runs about 80% slower.
Here is a simplified version of my code. I have the following loop to check for uppercase errors (e.g., "posSible"):
...ANSWER
Answered 2021-Jun-13 at 23:19headingsFix
strips out all the line endings, which you presumably did not intend. However, your question is about why changing the order of transformations results in slower execution, so I'll not discuss fixing that here.
fixUppercase
is extremely inefficient at handling lines with many words. It repeatedly calls line.split()
over and over again on the entire book-length string. That isn't terribly slow if each line has maybe a dozen words, but it gets extremely slow if you have one enormous line with tens of thousands of words. I found your program runs vastly faster with this change to only split each line once. (I note that I can't say whether your program is correct as it stands, just that this change should have the same behaviour while being a lot faster. I'm afraid I don't particularly understand why it's comparing each word to see if it's the same as the last word on the line.)
QUESTION
so first of this is my first time asking a question here so forgive me if I make any mistakes.
My Problem is as follows: I'm using python to sort through a bunch of images. The images are sorted by many criteria, one of which is the text inside the Image. I've got OCR working and have a list of "bad" words which arent supposed to be in the Image. The problem is that the OCR often confuses some letters, for example e and a. The question is if there is an easy way to generate similar looking words.
Like create_similar("test")
And output would be ["test", "tast" "lest"]
and so on.
So I could use that as the list of Bad words and avoid false negatives. If I'm just missing a really obvious solution, please tell me. I've been trying for hours now and just can't get it to work.
ANSWER
Answered 2021-Jun-13 at 22:18I really recommend this article by Peter Norvig on how to build a spelling corrector. In it, you will find the following function that returns a set of all the edited strings (whether words or not) that can be made with one simple edit. A simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter).
QUESTION
I am working on a python program that reads license plates from trucks. An image that gets processed by this program and filters the characters as output. Here is the input of the image in the program:
...ANSWER
Answered 2021-Jun-13 at 13:31You can capture a single frame by using the VideoCapture method of OpenCV.
QUESTION
I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn't work, but no answers as such. So, I decided to convert the PDF to an image, and then use pytesseract
to extract texts. I have downloaded the Hindi trained data, however that also gives highly inaccurate text.
That's the actual Hindi text from the PDF (download link):
That's my code so far:
...ANSWER
Answered 2021-Jun-08 at 14:46It seems the module pdfplumber
does the work:
QUESTION
I have some sketched images where the images contain text captions. I am trying to remove those caption.
I am using this code:
...ANSWER
Answered 2021-Jun-09 at 20:15The cv2
pre-processing is unecessary here, tesseract is able to find the text on its own. See the example below, commented inline:
QUESTION
I have the following code, which OCR's all PDF files in a specific folder (d:\extracttmp2), but it does not rename the files as I would like, or put the new files in the right place.
Currently, all files are within subfolders of 'extracttmp2'.
The OCR runs correctly, but I would like the OCR'ed files to be renamed to: -_ocred.pdf
. Naming them in such a manner will produce no file overwrites.
Currently, the code OCR's the files, but it saves the new files to the folder above the folder they are located in. It also saves the filenames as "JAN_ocred.pdf", for example, for a file named "JAN.pdf". The result of saving up one folder leads to some file overwrites, which is unwanted.
Also, it doesn't matter if the OCR'ed files remain in the folder where the un-OCR'ed files are located, or if they're saved up one folder. The desired renaming will eliminate any overwrites.
The software I'm using is PDF24. https://creator.pdf24.org/manual/10/#command-line. However, I think that my problem is not with the OCR software, but my syntax in the batch script.
Can anyone tell me what I am doing wrong?
...ANSWER
Answered 2021-Jun-09 at 21:32Is this what you mean? i.e. files will be saved in the same location as before, but each name will be prefixed with their parent directories' name, followed by a hyphen/dash.
QUESTION
Link to original image https://ibb.co/0VC6vkX
I am currently working with an OCR Project. I pre-processed the image, and then applied pre-trained EAST model for text detection.
...ANSWER
Answered 2021-Jun-07 at 07:02Here's a possible solution that you can try improving on by trying a few things:
- by varying Gaussian parameters
- by thresholding the blurred image to see if it improves the result
Code:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install ocr
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page