ocropus | OSX-buildable fork
kandi X-RAY | ocropus Summary
kandi X-RAY | ocropus Summary
OSX-buildable fork of ocropus 0.4
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of ocropus
ocropus Key Features
ocropus Examples and Code Snippets
Community Discussions
Trending Discussions on ocropus
QUESTION
I am trying to use the following OCR
project that is found here on github. I am using python 3 virtual environment. I am on Windows. I installed successfully requirements.txt
using Python 3.6.7
, however when I am attempting to do python install setup.py
I get the following error:
ANSWER
Answered 2020-May-19 at 18:21Read your error again, and you will see this at the 2nd line of your error:
QUESTION
I'd like to get the coordinates of all areas containing any text in scans of documents like the one shown below (in reduced quality; the original files are of high resolution):
I'm looking for something similar to these (GIMP'ed-up!) bounding boxes. It's important to me that the paragraphs be recognized as such. If the two big blocks (top box on left page, center block on right page) would get two bounding boxes each, though, that would be fine:
The way of obtaining these bounding box coordinates could be through some kind of API (scripted languages preferred over compiled ones) or through a command line command, I don't care. What's important is that I get the coordinates themselves, not just a modified version of the image where they're visible. The reason for that is that I need to calculate the area size of each one of them and then cut out a piece at the center of the largest.
What I've already tried, so far without success:
- ImageMagick - it's just not meant for such a task
- OpenCV - either the learning curve is too high or my google-foo too bad
- Tesseract - from what I've been able to garner, it's the one-off OCR software that, for historical reasons, doesn't do Page Layout Analysis before attempting character shape recognition
- OCRopus/OCRopy - should be able to do it, but I'm not finding out how to tell it I'm interested in paragraphs as opposed to words or characters
- Kraken ibn OCRopus - a fork of OCRopus with some rough edges, still fighting with it
- Using statistics, specifically, a clustering algorithm (OPTICS seems to be the one most appropriate for this task) after binarization of the image - both my maths and coding skills are insufficient for it
I've seen images around the internet of document scans being segmented into parts containing text, photos, and other elements, so this problem seems to be one that has academically already been solved. How to get to the goodies, though?
...ANSWER
Answered 2019-Jun-21 at 23:56In Imagemagick, you can threshold the image to keep from getting too much noise, then blur it and then threshold again to make large regions of black connected. Then use -connected-components
to filter out small regions, especially of white and then find the bounding boxes of the black regions. (Unix bash syntax)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install ocropus
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page