textract | js module for extracting text | Runtime Evironment library

by dbashford HTML Version: v2.5.0 License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | textract Summary

textract is a HTML library typically used in Server, Runtime Evironment, Nodejs applications. textract has no bugs, it has a Permissive License and it has medium support. However textract has 1 vulnerabilities. You can download it from GitHub.

A text extraction node module.

Support

Quality

Security

License

Reuse

Support

textract has a medium active ecosystem.

It has 1554 star(s) with 180 fork(s). There are 42 watchers for this library.

It had no major release in the last 6 months.

There are 53 open issues and 115 have been closed. On average issues are closed in 163 days. There are 13 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of textract is v2.5.0

Quality

textract has no bugs reported.

Security

textract has 1 vulnerability issues reported (0 critical, 1 high, 0 medium, 0 low).

License

textract is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

textract releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of textract

Get all kandi verified functions for this library.

textract Key Features

No Key Features are available at this moment for textract.

textract Examples and Code Snippets

No Code Snippets are available at this moment for textract.

Community Discussions

Trending Discussions on textract

Analyzing a Specific Page of a PDF with Amazon Textract

How to get the font style from an image with text?

Antiword can't open 'C:\\?????? ????????\\info.doc' for reading in Windows

Amazon Textract JSON missing some pages

textract_python_table_parser.py command prompt lacking credentials

Parsing multipage tables into CSV files with AWS Textract

Not able to retrieve processed file from S3 Bucket

Not able to access file in S3Bucket

AWS Collaboration on a Project

Converting multiple files in a directory into .txt format. But file names become Binary

QUESTION

Analyzing a Specific Page of a PDF with Amazon Textract

Asked 2021-May-28 at 03:55

I am using Amazon Textract to extract text from PDF files. For some of these documents, I want to be able to specify the pages from which data is to be extracted, rather than having to go through the entire thing. Is this possible? If so, how do I do it? I cannot seem to find an answer in the docs.

...

ANSWER

Answered 2021-May-28 at 03:55

I do not believe Textract offers this feature, but you can easily implement it programatically. Since your tags mention python, I'll suggest a way to do this using python. You can use a library like PyPDF2 which lets you specify which pages you want to extract and creates a new pdf with just those pages.

Source https://stackoverflow.com/questions/67724826

QUESTION

How to get the font style from an image with text?

Asked 2021-May-18 at 14:40

I am using the Amazon Textract API, through AWS' Python API, to extract text from a document (pdf or jpg). I do get the text and coordinates of its bounding box, but I would also love to have the font type (only the major ones needed: Arial, Helvetica, Verdana, Calibri, Times New Roman + a few others).

Does anyone have a solution to get that piece of data?

The best solution may be a package, which accepts a small image, returns the font type name, and which I can run on my server. An external API would most likely be too costly (money and time-wise), as I have to run it 100+ times in a second.

What Amazon Textract returns (unfortunately, no font type): ...

ANSWER

Answered 2021-May-18 at 14:40

At the moment the Amazon Textract does not support font recognition. These two projects might help you:

DeepFont: Identify Your Font from An Image

Paper: https://arxiv.org/pdf/1507.03196v1.pdf
GitHub: https://github.com/robinreni96/Font_Recognition-DeepFont

Typefont: The first open-source library that detects the font of a text in a image.(It's read only now.)

GitHub: https://github.com/Vasile-Peste/Typefont

Source https://stackoverflow.com/questions/67548410

QUESTION

Antiword can't open 'C:\\?????? ????????\\info.doc' for reading in Windows

Asked 2021-May-11 at 14:28

Description

I am using texttract python library to extract word document text. The problem is that: if the path contains arabic characters, then, antiword outputs that can't read the document.

Example ...

ANSWER

Answered 2021-May-11 at 14:28

After digging into the source code of textract, it becomes clear that for extraction from .doc the (ancient) command line tool antiword is used.

Source https://stackoverflow.com/questions/67469730

QUESTION

Amazon Textract JSON missing some pages

Asked 2021-Apr-23 at 16:32

I am using amazon textract to analyse pdf documents using the async APIs of amazon textract. After I perform the operations, in some cases the output Textract JSON is missing a few pages. What is the reason for missing a few files?

Ex: In this document, it has 4 pages.

But the extraction information is only available for 2 pages.

This is the document information

...

ANSWER

Answered 2021-Apr-23 at 16:32

It's the NextToken . When NextToken is populated you need to make another call to get the next segment of results. When NextToken is null, you have all the results.

I'm using CLI but

aws textract get-document-analysis --next-token FLpA6... --job-id 12345....

Source https://stackoverflow.com/questions/66108188

QUESTION

textract_python_table_parser.py command prompt lacking credentials

Asked 2021-Apr-14 at 22:25

I'm trying to put to work AWS's Textract export table suggestion in this link I'm a complete newbie in AWS's solutions and in command prompt so I'm trying to do exactly as they suggest. I'm running that in python so I'm using this piece of code:

...

ANSWER

Answered 2021-Apr-14 at 22:25

It depends where you run your code, for example:

local computer - can use aws configure CLI to set your credetnails
EC2 instance - use instance role
lambda function - use lambda execution role

Source https://stackoverflow.com/questions/67091278

QUESTION

Parsing multipage tables into CSV files with AWS Textract

Asked 2021-Apr-14 at 18:46

I'm a total AWS newbie trying to parse tables of multi page files into CSV files with AWS Textract. I tried using AWS's example in this page however when we are dealing with a multi-page file the response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) breaks since we need asynchronous processing in those cases, as you can see in the documentation here. The correct function to call would be client.start_document_analysis and after running it retrieve the file using client.get_document_analysis(JobId).

So, I adapted their example using this logic instead of using client.analyze_document function, the adapted piece of code looks like this:

...

ANSWER

Answered 2021-Apr-14 at 18:46

Consider use two different lambdas. One for call textract and one for process the result.

Please read this document

https://aws.amazon.com/blogs/compute/getting-started-with-rpa-using-aws-step-functions-and-amazon-textract/

And check this repository

https://github.com/aws-samples/aws-step-functions-rpa

To process the JSON you can use this sample as reference https://github.com/aws-samples/amazon-textract-response-parser or use it directly as library.

Source https://stackoverflow.com/questions/67097279

QUESTION

Not able to retrieve processed file from S3 Bucket

Asked 2021-Apr-13 at 04:39

I'm an AWS newbie trying to use Textract API, their OCR service. As far as I understood I need to upload files to a S3 bucket and then run textract on it.

I got the bucket on and the file inside it:

I got the permissions:

But when I run my code it bugs.

...

ANSWER

Answered 2021-Apr-12 at 13:40

Amazon Textract currently supports PNG, JPEG, and PDF formats. Looks like you are using PDF.

Once you have a valid format, you can use the Python S3 API to read the data of the object in the S3 object. Once you read the object, you can pass the byte array to the analyze_document method. TO see a full example of how to use the AWS SDK for Python (Boto3) with Amazon Textract to detect text, form, and table elements in document images.

https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/textract/textract_wrapper.py

Try following that code example to see if your issue is resolved.

"Could you provide some clearance on the params to use"

I just ran the Java V2 example and it works perfecly. In this example, i am using a PNG file located in a specific Amazon S3 bucket.

Here are the parameters that you need:

Make sure when implementing this in Python, you set the same parameters.

Source https://stackoverflow.com/questions/67045414

QUESTION

Not able to access file in S3Bucket

Asked 2021-Apr-11 at 12:59

I'm a total newbie in AWS trying to use Textract. Right now I have a file in the S3Bucket and I'm trying to process it through the API. I'm using the following code (I'm running it in Python):

...

ANSWER

Answered 2021-Apr-10 at 21:53

The S3 object url will only work if the object is public. To download it through S3 console you have to Download it using S3 object menu for that:

Source https://stackoverflow.com/questions/67039475

QUESTION

AWS Collaboration on a Project

Asked 2021-Mar-10 at 00:45

So, I am working on a project where I will need to use tons of services that AWS has to offer like S3, EC2, Route53, Textract, RDS and some more. However, during the course of the project, I am going to be collaborating on the project with my team.

I know I would set up an IAM User for everyone on the team. But, how do I assure that everyone has access to the services mentioned above so that we can work on the project together?

...

ANSWER

Answered 2021-Mar-10 at 00:45

Great question. Make sure you have one root account and enable MFA on it. Then go to IAM and create a user yourself. Try to change the 12321312.signin.aws.com link to yourTeamName.signin.aws.com for easy logins too.

Once you create a user for yourself, go to the side on IAM and click groups. Create a new group called "admins". Click next and then attach the administratorAccess policy. AWS Img 1 Then click next. Then save it. Then click on your group, and then click "add users to group" and select the user account for yourself that you just made.

Now, log out of the root account and log into your name@projectNameRootAccount and you can do everything that you did on your root. Only difference is you are not coding on your root and it is safe from hackers.

Next.... to make it extremely simple.... You could just create a user under IAM for each of them and add them to your admins group.

Best practice would be to give them less privileges but if it is an informal thing.... not a big deal. I would consider clicking on their IAM names from time to time and click on the access advisor tab to see what services they are actually using. If they are only using "dev" type things, you can see that pretty clearly within a week or so and then you can go and create a new group for "devs" and then just give them the policies you see under "access advisor (when you click on their IAM name)".

Then you can take them out of the admins group and give them the access that they truly need.

Thanks and good luck with AWS. It can be complicated haha.

Source https://stackoverflow.com/questions/66544849

QUESTION

Converting multiple files in a directory into .txt format. But file names become Binary

Asked 2021-Feb-20 at 00:43

So I am creating plagiarism software, for that, I need to convert .pdf, .docx,[enter image description here][1] etc files into a .txt format. I successfully found a way to convert all the files in one directory to another. BUT the problem is, this method is changing the file names

into binary values. I need to get the original file name which I am gonna need in the next phase.

...

ANSWER

Answered 2021-Feb-19 at 13:11

remove this line where you rename the files:

Source https://stackoverflow.com/questions/66277995

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install textract

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: