tesseract-ocr | package contains the Tesseract Open Source OCR Engine | Computer Vision library

 by   jimregan C++ Version: Current License: Non-SPDX

kandi X-RAY | tesseract-ocr Summary

kandi X-RAY | tesseract-ocr Summary

tesseract-ocr is a C++ library typically used in Artificial Intelligence, Computer Vision applications. tesseract-ocr has no bugs, it has no vulnerabilities and it has low support. However tesseract-ocr has a Non-SPDX License. You can download it from GitHub.

This package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado, all the code in this distribution is now licensed under the Apache License:.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              tesseract-ocr has a low active ecosystem.
              It has 12 star(s) with 2 fork(s). There are 2 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              tesseract-ocr has no issues reported. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of tesseract-ocr is current.

            kandi-Quality Quality

              tesseract-ocr has no bugs reported.

            kandi-Security Security

              tesseract-ocr has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              tesseract-ocr has a Non-SPDX License.
              Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

            kandi-Reuse Reuse

              tesseract-ocr releases are not available. You will need to build from source code and install.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of tesseract-ocr
            Get all kandi verified functions for this library.

            tesseract-ocr Key Features

            No Key Features are available at this moment for tesseract-ocr.

            tesseract-ocr Examples and Code Snippets

            No Code Snippets are available at this moment for tesseract-ocr.

            Community Discussions

            QUESTION

            How to improve Hindi text extraction?
            Asked 2021-Jun-11 at 20:13

            I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn't work, but no answers as such. So, I decided to convert the PDF to an image, and then use pytesseract to extract texts. I have downloaded the Hindi trained data, however that also gives highly inaccurate text.

            That's the actual Hindi text from the PDF (download link):

            That's my code so far:

            ...

            ANSWER

            Answered 2021-Jun-08 at 14:46

            It seems the module pdfplumber does the work:

            Source https://stackoverflow.com/questions/67816185

            QUESTION

            How to remove text from the sketched image
            Asked 2021-Jun-10 at 04:07

            I have some sketched images where the images contain text captions. I am trying to remove those caption.

            I am using this code:

            ...

            ANSWER

            Answered 2021-Jun-09 at 20:15

            The cv2 pre-processing is unecessary here, tesseract is able to find the text on its own. See the example below, commented inline:

            Source https://stackoverflow.com/questions/67910691

            QUESTION

            Could not find a package configuration file provided by "Leptonica"
            Asked 2021-Jun-07 at 18:55

            I am trying to generate a visual studio 2019 C++ project from the tesseract 4.1.1 source code. Ultimately, I want to include a tesseract C++ project in my custom solution that consumes OCR results.

            When I follow these steps:

            1. Download and extract tesseract code https://github.com/tesseract-ocr/tesseract/archive/refs/tags/4.1.1.zip to "C:\tesseract" directory.
            2. Execute the following commands in a Developer Command Prompt for VS 2019:

            C:\Windows\System32>cd "C:\tesseract"
            C:\tesseract>mkdir build
            C:\tesseract>cd build
            C:\tesseract\build>cmake ..

            I receive this error:

            ...

            ANSWER

            Answered 2021-Jun-05 at 07:13

            There are several tutorial how to build tesseract on windows with cmake and VS e.g. https://bucket401.blogspot.com/2021/03/building-tesserocr-on-ms-windows-64bit.html (you can ignore end of tutorial - python module), minimalist tesseract or with clang

            Source https://stackoverflow.com/questions/67839925

            QUESTION

            Error when running python script in node js with python-shell npm
            Asked 2021-May-16 at 17:24

            I am developing a web application which has image processing functions. So I used opencv-python and implemented the python script to node js using python-shell package,

            index.js;

            ...

            ANSWER

            Answered 2021-May-16 at 17:24

            I solved the error by giving the full path of the image in the python script to imread()

            Source https://stackoverflow.com/questions/67147130

            QUESTION

            How to OCR a text with white colour characters on a blue background from a cropped image?
            Asked 2021-May-06 at 10:37

            First, I want to crop an image using a mouse event, and then print the text inside the cropped image. I tried OCR scripts but all can't work for this image attached below. I think the reason is that the text has white characters on blue background.

            Can you help me with doing this?

            Full image:

            Cropped image:

            An example what I tried is:

            ...

            ANSWER

            Answered 2021-May-06 at 10:37

            [EDIT]

            For anyone wondering, the image in the question was updated after posting my answer. That was the original image:

            Thus, the below output in my original answer.

            That's the newly posted image:

            The specific Turkish characters, especially in the last word, are still not properly detected (since I still can't use lang='tur' right now), but at least the Ö and Ü can be detected using lang='deu', which I have installed:

            Source https://stackoverflow.com/questions/67410136

            QUESTION

            Detecting white text on a bright background with tesseract
            Asked 2021-May-05 at 01:11

            I'm having issues reading white text on a bright background, it finds the text itself but it cannot really translate it correctly.

            The image:

            The result I keep getting is LanEerus which is not that far off, to be honest.

            What I'm wondering is what image pre-processing could fix this? I'm using photoshop to manually pre-process it before I try to do it with code, to find what should work first.

            I've tried making it a bitmap, but that makes the borders of the text pretty bad, resulting in tesseract just translating it to random characters.

            Inverting colors and/or grayscaling doesn't seem to do the trick, either.

            Anyone have any ideas? I know it's a pretty bad background for the text for this case. Trust me, I wish that the background was different!

            My code for the tests:

            ...

            ANSWER

            Answered 2021-May-05 at 01:11

            Here's one possible solution. This is in Python, but it should be clear enough for a Java port. We will apply a method called gained division. The idea is that you try to build a model of the background and then weight each input pixel by that model. The output gain should be relatively constant during most of the image. This will get rid of most of the background color variation. We can use a morphological chain to clean the result a little bit, let's see the code:

            Source https://stackoverflow.com/questions/67386714

            QUESTION

            Open CV OCR improve data extraction from color image with background
            Asked 2021-Apr-28 at 10:22

            I am trying to extract some info from mobile screen shots. Though my code is able to retrieve some info , but not all of it. I read the image converted to grey , then removed non required parts and applied Gaussian Threshold. But the entire text is not getting read.

            ...

            ANSWER

            Answered 2021-Apr-28 at 10:22

            Have a look at the page segmentation modes of pytesseract, cf. this Q&A. For example, using config='-psm 12' will already give you all desired texts. Nevertheless, those graphs are also somehow interpreted as texts.

            That's why I would preprocess the image to get single boxes (actual texts, the graphs, those information from the top, etc.), and filter to only store those boxes with the content of interest. That could be done by using

            • the y coordinate of the bounding rectangle (not in the upper 5 % of the image, that's the mobile phone status bar),
            • the width w of the bounding rectangle (not wider than 50 % of the image' width, these are the horizontal lines),
            • the x coordinate of the bounding rectangle (not in middle third of the image, these are the graphs).

            What's left is to run pytesseract on each cropped image with config='-psm 6' for example (assume a single uniform block of text), and clean the texts from any line breaks.

            That'd be my code:

            Source https://stackoverflow.com/questions/67187438

            QUESTION

            Performing OCR of Seven Segment Display images
            Asked 2021-Apr-19 at 06:01

            I'm working on performing OCR of energy meter displays: example 1 example 2 example 3

            I tried to use tesseract-ocr with the letsgodigital trained data. But the performance is very poor.

            I'm fairly new to the topic and this is what I've done:

            ...

            ANSWER

            Answered 2021-Apr-19 at 06:01

            Notice how your power meters either use blue or green LEDs to light up the display; I suggest you use this color display to your advantage. What I'd do is select only one RGB channel based on the LED color. Then I can threshold it based on some algorithm or assumption. After that, you can do the downstream steps of cropping / resizing / transformation / OCR etc.

            For example, using your example image 1, look at its histogram here. Notice how there is a small peak of green to the right of the 150 mark.

            I take advantage of this, and set anything below 150 to zero. My assumption being that the green peak is the bright green LED in the image.

            Source https://stackoverflow.com/questions/67146380

            QUESTION

            unable to get text from the image
            Asked 2021-Apr-18 at 06:54

            I'm learning AI/ML and trying to get text from this sample form.

            ...

            ANSWER

            Answered 2021-Apr-18 at 06:54

            This link provides me the answer. Its removing the noise in the background image.

            Source https://stackoverflow.com/questions/67106600

            QUESTION

            watchdog.observers.Observer works in Windows, works in docker on Linux, does not work in docker on Windows
            Asked 2021-Apr-10 at 01:11

            I have an interesting problem that is driving me nuts. I have a python program that is using watchdog.observers.Observer. This program (aka watcher) watches a folder and responds when files appear in it. I have another program (aka parser) which periodically populates the watched folder with files.

            1. When the watcher program runs in Windows and the parser runs in a docker container on Windows, there is happiness.
            2. When the watcher program runs in a docker container on a Linux box and the parser runs in another docker container on the Linux box, there is happiness.
            3. When the watcher program runs in a docker container on Windows and the parser runs in another docker container on Windows, happiness is not achieved. The parser populates the folder with files, but the watcher never observes them.

            Here's my watcher code:

            ...

            ANSWER

            Answered 2021-Apr-10 at 01:11

            The underlying API that watchdog uses to monitor linux filesystem events is called inotify. The Docker for Windows WSL 2 backend documentation notes:

            Linux containers only receive file change events (“inotify events”) if the original files are stored in the Linux filesystem.

            The directory you're mounting, c:\My_MR, resides on the Windows file system and thus inotify inside the watcher container doesn't work.

            Instead, you can run docker from inside your WSL 2 default distribution with a linux filesystem path, e.g., ~/my_mr:

            Source https://stackoverflow.com/questions/66909254

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install tesseract-ocr

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/jimregan/tesseract-ocr.git

          • CLI

            gh repo clone jimregan/tesseract-ocr

          • sshUrl

            git@github.com:jimregan/tesseract-ocr.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link