hocr | facilitate post-OCR data processing | Computer Vision library

 by   dmi3kno R Version: Current License: Non-SPDX

kandi X-RAY | hocr Summary

kandi X-RAY | hocr Summary

hocr is a R library typically used in Artificial Intelligence, Computer Vision applications. hocr has no bugs, it has no vulnerabilities and it has low support. However hocr has a Non-SPDX License. You can download it from GitHub.

The goal of hocr is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, hocr_parse, which converts XHTML format output into tidy tibble with one word per row. In addition to the columns exported by tesseract::ocr_data, hocr outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification here. One of the key elements of hocr format is “bounding box” - a rectangular region of the image covering the extent of the word recognized by tesseract. This bbox can be used to extract respective part of the image using, for example magick package, using bbox_to_geometry helper function. hocr aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only tesseract output format is supported, but in the future, support for OCRopus will be added.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              hocr has a low active ecosystem.
              It has 32 star(s) with 2 fork(s). There are 4 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 3 open issues and 2 have been closed. On average issues are closed in 134 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of hocr is current.

            kandi-Quality Quality

              hocr has no bugs reported.

            kandi-Security Security

              hocr has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              hocr has a Non-SPDX License.
              Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

            kandi-Reuse Reuse

              hocr releases are not available. You will need to build from source code and install.
              Installation instructions, examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of hocr
            Get all kandi verified functions for this library.

            hocr Key Features

            No Key Features are available at this moment for hocr.

            hocr Examples and Code Snippets

            No Code Snippets are available at this moment for hocr.

            Community Discussions

            QUESTION

            Tesseract : Line detection too sensitive
            Asked 2021-May-26 at 21:19

            I am trying to detect the .pdf file text. They are first converted to an image, then given to Tesseract. The detection is good but they make too many line breaks. For example if the file is a bit panched on the right, the sentence:
            "I like Tesseract for reading text"
            become:
            "text read for Tesseract like I"
            And that's already after a treatment because the raw text is :
            "text
            read
            for
            Tesseract
            like
            I"
            The bug occurs since the source .pdf are in 300DPI, I understand that the problem comes from the resolution but I cannot find how to solve it. Here is my Tesseract cmd Tesseract.exe dummy.pdf dumy-ocr.pdf --psm 12 --dpi 300 -l bvr+fra+eng+deu hocr pdf
            First, I would like to solve the problem of too many lines, Then I would find out how to make the image perfectly straight
            Thank you in advance for your help

            https://i.stack.imgur.com/crmdO.jpg

            ...

            ANSWER

            Answered 2021-May-26 at 21:19

            You seem to be working backwards. The "many" lines and thus word reversal are due to the anti-clockwise rotation.

            Source https://stackoverflow.com/questions/67598664

            QUESTION

            No output for OCRmyPDF
            Asked 2021-Jan-05 at 08:18

            I'm using OCRmyPDF to extract text form scanned pdf files. I use codes from this Colab notebook for that purpose. The only difference is that instead of downloading the pdf file from an online url, I use the pdf file stored on my local machine (replaced it {file_name} instead of {invoice_pdf}). Everything looks fine up to the point I run:

            ...

            ANSWER

            Answered 2021-Jan-05 at 08:18

            If the file name contains spaces, then you need to enclose the name in quotation marks.

            Source https://stackoverflow.com/questions/65575093

            QUESTION

            Apache Tika Server - Request Header Parameters?
            Asked 2020-May-26 at 03:47

            The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:

            ...

            ANSWER

            Answered 2020-May-26 at 03:47

            The code that handles the X-Tika-OCR and X-Tika-PDF headers is TikaResource.processHeaderConfig.

            Those header suffixes and values are then mapped onto the TesseractOCRConfig and PDFParserConfig configuration objects via reflection.

            So, to see what X-Tika headers you can set, look up the options on the config class you want to tweak things on (Tesseract or PDF), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.

            For eg setExtractInlineImages on PDF, that maps to X-Tika-PDFextractInlineImages

            Source https://stackoverflow.com/questions/62011038

            QUESTION

            getting hocr output from tika-server
            Asked 2020-Feb-06 at 07:08

            I am doing OCR to a PDF file using Apache TIKA Server.

            I am interested in the hOCR output, but only succeed to get the output in plain text format.

            Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers. In this case, I am using the X-Tika-OCRoutputType: hocr HTTP header, but I get the plain text output or html output without HOCR tags.

            I tried both the /tika and /rmeta endpoints.

            The curl commands I use:

            ...

            ANSWER

            Answered 2020-Feb-06 at 07:08

            By inspecting the integration test code of TikaResourceTest, I realized an HTTP header was missing. The correct command should include the X-Tika-PDFOcrStrategy: ocr_only HTTP header. See more in the ocr & pdf parser docs

            The command would thus be:

            Source https://stackoverflow.com/questions/59662119

            QUESTION

            How do I change the contrast of a picture using Wand?
            Asked 2020-Jan-08 at 13:44

            I have the picture below used in Tesseract OCR:

            My code to process the picture is:

            ...

            ANSWER

            Answered 2020-Jan-07 at 18:11

            I would use cv2 and/or numpy.array

            to convert light gray colors to white

            Source https://stackoverflow.com/questions/59632931

            QUESTION

            How to get the co-ordinates of the text recogonized from Image using OCR in python
            Asked 2019-Dec-09 at 08:08

            I am trying to get the coordinates or positions of text character from an Image using Tesseract. I want to know the exact pixel position, so that i can click that text using some other tool.

            Edit :

            ...

            ANSWER

            Answered 2018-Feb-22 at 17:08

            You have the coordinates of the bounding box in every line.

            From: Training Tesseract – Make Box Files

            character, left, bottom, right, top, page

            So for each character you get the character, followed by its bounding box characters, followed by the 0-based page number.

            Source https://stackoverflow.com/questions/48928592

            QUESTION

            AttributeError: module 'pytesseract' has no attribute 'run_tesseract'
            Asked 2019-Dec-04 at 06:45

            I am trying to use the run_tesseract function to get an hocr output for extracting text from an image for Bank receipt images.However I am getting the above error message. I have installed Tesseract-OCR on my laptop, and have also added its path to my System Path variable.I have a windows 10 64 bit operating system,

            I have tried uninstalling and reinstalling it also but to no avail.

            ...

            ANSWER

            Answered 2019-Dec-04 at 01:40

            Replace pytesseract.run_tesseract() with pytesseract.pytesseract.run_tesseract().

            Credit Nithin in the comments. Adding this as an answer to close it out.

            Source https://stackoverflow.com/questions/56286006

            QUESTION

            Delete OCR word from Image (OpenCV,Python)
            Asked 2019-Sep-18 at 23:53

            So, from what I can begin..

            I am working with OCR. The script works pretty well for what I need. It detects the words with an accuracy which for me is ok.

            This is the result: 100% accuracy with attached image.

            ...

            ANSWER

            Answered 2019-Sep-18 at 23:53

            Here's a simple approach

            • Convert image to grayscale
            • Otsu's threshold
            • Dilate to connect contours
            • Find contours and extract ROI for each word
            • Perform OCR and remove word

            After converting to grayscale, we Otsu's threshold to obtain a binary image

            Next we invert the image and dilate to form a single contour for each word

            From here we find contours and extract the ROI for each word. Here's the detected ROIs

            We throw each ROI into Pytesseract OCR. If the OCR result is a word we want to remove, we simply "delete" the word by filling in the ROI with white and replace it in the original image

            With

            Source https://stackoverflow.com/questions/47226647

            QUESTION

            pytesseract temporary output files "No such file or directory" error
            Asked 2019-Mar-25 at 15:04

            I am using pytesseract with the line:

            ...

            ANSWER

            Answered 2017-Aug-07 at 03:05

            It turns out that the reason that pytesseract was not able to find the temporary output files were that they were being stored with extensions other than .txt or .box (they were .hocr files). From the source code, these are the only types of tesseract output files supported by pytesseract (or more like 'looked for' by pytesseract). The relevant snippets from the source are below:

            input_file_name = '%s.bmp' % tempnam() output_file_name_base = tempnam() if not boxes: output_file_name = '%s.txt' % output_file_name_base else: 123 output_file_name = '%s.box' % output_file_name_base

            if status: errors = get_errors(error_string) raise TesseractError(status, errors) 135 f = open(output_file_name, 'rb')

            Looking at the pytesseract's github pulls, it seems like support for other output types is planned but not yet implemented (the source code I used to show why .hocr file were appearing to not be found was copy/pasted from the pytesseract master branch).

            Until then, I made some hackish changes to the pytesseract script to support multiple file types.

            This version does not set an extension for the output file (since tesseract does that automatically) and looks through the directory that pytesseract stores its temp output files to and looks for the file that starts with the output file name (up to the first '.' character) assigned by pytesseract (without caring about the extension):

            Source https://stackoverflow.com/questions/45538673

            QUESTION

            Extract data from tesseract hocr xhtml file
            Asked 2018-Jun-05 at 15:35

            I'm trying to use Python to extract data from Tesseract's hocr output file. We're limited to tesseact version 3.04, so no image_to_data function or tsv output is available. I have been able to do it with beautifulsoup and in R, but that's neither are available in the environment in which it needs to be deployed. I am just trying to extract the word and confidence "x_wconf." An example output file is below, for which I'd be happy to just return lists of [90, 87, 89, 89] and ['the', '(quick)', '[brown]', '{fox}', 'jumps!'].

            lxml is the only available xml parser outside of the elementtree in the environment so I'm a bit at a loss for how to proceed.

            ...

            ANSWER

            Answered 2018-Jun-05 at 15:35

            Figured out a (gross) way to do it using xpath.

            Source https://stackoverflow.com/questions/50702264

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install hocr

            You can install the development version from GitHub with:.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/dmi3kno/hocr.git

          • CLI

            gh repo clone dmi3kno/hocr

          • sshUrl

            git@github.com:dmi3kno/hocr.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link