hocr | facilitate post-OCR data processing | Computer Vision library

by dmi3kno R Version: Current License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | hocr Summary

hocr is a R library typically used in Artificial Intelligence, Computer Vision applications. hocr has no bugs, it has no vulnerabilities and it has low support. However hocr has a Non-SPDX License. You can download it from GitHub.

The goal of hocr is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, hocr_parse, which converts XHTML format output into tidy tibble with one word per row. In addition to the columns exported by tesseract::ocr_data, hocr outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification here. One of the key elements of hocr format is “bounding box” - a rectangular region of the image covering the extent of the word recognized by tesseract. This bbox can be used to extract respective part of the image using, for example magick package, using bbox_to_geometry helper function. hocr aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only tesseract output format is supported, but in the future, support for OCRopus will be added.

Support

Quality

Security

License

Reuse

Support

hocr has a low active ecosystem.

It has 32 star(s) with 2 fork(s). There are 4 watchers for this library.

It had no major release in the last 6 months.

There are 3 open issues and 2 have been closed. On average issues are closed in 134 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of hocr is current.

Quality

hocr has no bugs reported.

Security

hocr has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

hocr has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

hocr releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of hocr

Get all kandi verified functions for this library.

hocr Key Features

No Key Features are available at this moment for hocr.

hocr Examples and Code Snippets

No Code Snippets are available at this moment for hocr.

Community Discussions

Trending Discussions on hocr

Tesseract : Line detection too sensitive

No output for OCRmyPDF

Apache Tika Server - Request Header Parameters?

getting hocr output from tika-server

How do I change the contrast of a picture using Wand?

How to get the co-ordinates of the text recogonized from Image using OCR in python

AttributeError: module 'pytesseract' has no attribute 'run_tesseract'

Delete OCR word from Image (OpenCV,Python)

pytesseract temporary output files "No such file or directory" error

Extract data from tesseract hocr xhtml file

QUESTION

Tesseract : Line detection too sensitive

Asked 2021-May-26 at 21:19

I am trying to detect the .pdf file text. They are first converted to an image, then given to Tesseract. The detection is good but they make too many line breaks. For example if the file is a bit panched on the right, the sentence:
"I like Tesseract for reading text"
become:
"text read for Tesseract like I"
And that's already after a treatment because the raw text is :
"text
read
for
Tesseract
like
I"
The bug occurs since the source .pdf are in 300DPI, I understand that the problem comes from the resolution but I cannot find how to solve it. Here is my Tesseract cmd Tesseract.exe dummy.pdf dumy-ocr.pdf --psm 12 --dpi 300 -l bvr+fra+eng+deu hocr pdf
First, I would like to solve the problem of too many lines, Then I would find out how to make the image perfectly straight
Thank you in advance for your help

https://i.stack.imgur.com/crmdO.jpg

...

ANSWER

Answered 2021-May-26 at 21:19

You seem to be working backwards. The "many" lines and thus word reversal are due to the anti-clockwise rotation.

Source https://stackoverflow.com/questions/67598664

QUESTION

No output for OCRmyPDF

Asked 2021-Jan-05 at 08:18

I'm using OCRmyPDF to extract text form scanned pdf files. I use codes from this Colab notebook for that purpose. The only difference is that instead of downloading the pdf file from an online url, I use the pdf file stored on my local machine (replaced it {file_name} instead of {invoice_pdf}). Everything looks fine up to the point I run:

...

ANSWER

Answered 2021-Jan-05 at 08:18

If the file name contains spaces, then you need to enclose the name in quotation marks.

Source https://stackoverflow.com/questions/65575093

QUESTION

Apache Tika Server - Request Header Parameters?

Asked 2020-May-26 at 03:47

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:

...

ANSWER

Answered 2020-May-26 at 03:47

The code that handles the X-Tika-OCR and X-Tika-PDF headers is TikaResource.processHeaderConfig.

Those header suffixes and values are then mapped onto the TesseractOCRConfig and PDFParserConfig configuration objects via reflection.

So, to see what X-Tika headers you can set, look up the options on the config class you want to tweak things on (Tesseract or PDF), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.

For eg setExtractInlineImages on PDF, that maps to X-Tika-PDFextractInlineImages

Source https://stackoverflow.com/questions/62011038

QUESTION

getting hocr output from tika-server

Asked 2020-Feb-06 at 07:08

I am doing OCR to a PDF file using Apache TIKA Server.

I am interested in the hOCR output, but only succeed to get the output in plain text format.

Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers. In this case, I am using the X-Tika-OCRoutputType: hocr HTTP header, but I get the plain text output or html output without HOCR tags.

I tried both the /tika and /rmeta endpoints.

The curl commands I use:

...

ANSWER

Answered 2020-Feb-06 at 07:08

By inspecting the integration test code of TikaResourceTest, I realized an HTTP header was missing. The correct command should include the X-Tika-PDFOcrStrategy: ocr_only HTTP header. See more in the ocr & pdf parser docs

The command would thus be:

Source https://stackoverflow.com/questions/59662119

QUESTION

How do I change the contrast of a picture using Wand?

Asked 2020-Jan-08 at 13:44

I have the picture below used in Tesseract OCR:

My code to process the picture is:

...

ANSWER

Answered 2020-Jan-07 at 18:11

I would use cv2 and/or numpy.array

to convert light gray colors to white

Source https://stackoverflow.com/questions/59632931

QUESTION

How to get the co-ordinates of the text recogonized from Image using OCR in python

Asked 2019-Dec-09 at 08:08

I am trying to get the coordinates or positions of text character from an Image using Tesseract. I want to know the exact pixel position, so that i can click that text using some other tool.

Edit :

...

ANSWER

Answered 2018-Feb-22 at 17:08

You have the coordinates of the bounding box in every line.

From: Training Tesseract – Make Box Files

character, left, bottom, right, top, page

So for each character you get the character, followed by its bounding box characters, followed by the 0-based page number.

Source https://stackoverflow.com/questions/48928592

QUESTION

AttributeError: module 'pytesseract' has no attribute 'run_tesseract'

Asked 2019-Dec-04 at 06:45

I am trying to use the run_tesseract function to get an hocr output for extracting text from an image for Bank receipt images.However I am getting the above error message. I have installed Tesseract-OCR on my laptop, and have also added its path to my System Path variable.I have a windows 10 64 bit operating system,

I have tried uninstalling and reinstalling it also but to no avail.

...

ANSWER

Answered 2019-Dec-04 at 01:40

Replace pytesseract.run_tesseract() with pytesseract.pytesseract.run_tesseract().

Credit Nithin in the comments. Adding this as an answer to close it out.

Source https://stackoverflow.com/questions/56286006

QUESTION

Delete OCR word from Image (OpenCV,Python)

Asked 2019-Sep-18 at 23:53

So, from what I can begin..

I am working with OCR. The script works pretty well for what I need. It detects the words with an accuracy which for me is ok.

This is the result: 100% accuracy with attached image.

...

ANSWER

Answered 2019-Sep-18 at 23:53

Here's a simple approach

Convert image to grayscale
Otsu's threshold
Dilate to connect contours
Find contours and extract ROI for each word
Perform OCR and remove word

After converting to grayscale, we Otsu's threshold to obtain a binary image

Next we invert the image and dilate to form a single contour for each word

From here we find contours and extract the ROI for each word. Here's the detected ROIs

We throw each ROI into Pytesseract OCR. If the OCR result is a word we want to remove, we simply "delete" the word by filling in the ROI with white and replace it in the original image

With

Source https://stackoverflow.com/questions/47226647

QUESTION

pytesseract temporary output files "No such file or directory" error

Asked 2019-Mar-25 at 15:04

I am using pytesseract with the line:

...

ANSWER

Answered 2017-Aug-07 at 03:05

It turns out that the reason that pytesseract was not able to find the temporary output files were that they were being stored with extensions other than .txt or .box (they were .hocr files). From the source code, these are the only types of tesseract output files supported by pytesseract (or more like 'looked for' by pytesseract). The relevant snippets from the source are below:

input_file_name = '%s.bmp' % tempnam() output_file_name_base = tempnam() if not boxes: output_file_name = '%s.txt' % output_file_name_base else: 123 output_file_name = '%s.box' % output_file_name_base

if status: errors = get_errors(error_string) raise TesseractError(status, errors) 135 f = open(output_file_name, 'rb')

Looking at the pytesseract's github pulls, it seems like support for other output types is planned but not yet implemented (the source code I used to show why .hocr file were appearing to not be found was copy/pasted from the pytesseract master branch).

Until then, I made some hackish changes to the pytesseract script to support multiple file types.

This version does not set an extension for the output file (since tesseract does that automatically) and looks through the directory that pytesseract stores its temp output files to and looks for the file that starts with the output file name (up to the first '.' character) assigned by pytesseract (without caring about the extension):

Source https://stackoverflow.com/questions/45538673

QUESTION

Extract data from tesseract hocr xhtml file

Asked 2018-Jun-05 at 15:35

I'm trying to use Python to extract data from Tesseract's hocr output file. We're limited to tesseact version 3.04, so no image_to_data function or tsv output is available. I have been able to do it with beautifulsoup and in R, but that's neither are available in the environment in which it needs to be deployed. I am just trying to extract the word and confidence "x_wconf." An example output file is below, for which I'd be happy to just return lists of [90, 87, 89, 89] and ['the', '(quick)', '[brown]', '{fox}', 'jumps!'].

lxml is the only available xml parser outside of the elementtree in the environment so I'm a bit at a loss for how to proceed.

...

ANSWER

Answered 2018-Jun-05 at 15:35

Figured out a (gross) way to do it using xpath.

Source https://stackoverflow.com/questions/50702264

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install hocr

You can install the development version from GitHub with:.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: