textricator | extract text from documents and generate structured data | Regex library
kandi X-RAY | textricator Summary
kandi X-RAY | textricator Summary
Textricator is a tool to extract text from documents and generate structured data.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of textricator
textricator Key Features
textricator Examples and Code Snippets
Community Discussions
Trending Discussions on textricator
QUESTION
I'm trying to use the PDF document parser called Textricator. It can use 3 different methods for parsing a PDF with some common OCR libraries. (itext5, itext7, pdfbox) The available methods are: text
, table
and form
. Text for normal raw OCR recognition, table to read out structured table data, and form for parsing less structured forms, using a Finite State Machine (FSM).
However, I am not able to use the form parser. Perhaps I simply don't understand how to organize the many configuration states. The documentation is lacking a simple form example, and someone recently posted an attempt to read a very basic table using the form
method, but was not able to. I also gave it a shot, but without any success.
Q: Can someone help me configure the state machine in the YML file?
(This is used to parse the demo file from one of that repo's issues, and shown in the copied screenshot below.)
The YML configuration file.
...ANSWER
Answered 2021-May-17 at 18:42As Textricator is kind of a hidden gem for pdf parsing imo, I'm happy to see someone using it and posted a config working with the sample document to the github issue:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install textricator
Download the latest build of Textricator from https://repo1.maven.org/maven2/io/mfj/textricator/ - click on the directory for the latest version and download textricator-VERSION-bin.tgz (or textricator-VERSION-bin.zip for Windows).
Extract it.
Run a shell Windows: run Windows Powershell (it should be in the start menu) The following examples start with ./textricator. On Windows, use .\textricator.bat. MacOS: Run Terminal (type "terminal" in Spotlight)
Show help ./textricator --help
Download the example files to the textricator directory: https://github.com/measuresforjustice/textricator/blob/main/src/test/resources/io/mfj/textricator/examples/school-employee-list.pdf https://github.com/measuresforjustice/textricator/blob/main/src/test/resources/io/mfj/textricator/examples/school-employee-list.yml
Extract raw text from a PDF to standard out ./textricator text --input-format=pdf.pdfbox school-employee-list.pdf
Parse a PDF to CSV ./textricator form --config=school-employee-list.yml school-employee-list.pdf school-employee-list.csv This uses the configuration file school-employee-list.yml to parse school-employee-list.pdf. To parse your own PDF form, you will need to write your own configuration file. See the Form section for details. If your PDF has a tabular layout, see the Table section.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page