pdf2docx | Open source Python library converting pdf to docx | File Utils library

by dothinking Python Version: 0.5.6 License: GPL-3.0

X-Ray Key Features Code Snippets(1)Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | pdf2docx Summary

pdf2docx is a Python library typically used in Utilities, File Utils applications. pdf2docx has no bugs, it has no vulnerabilities, it has build file available, it has a Strong Copyleft License and it has medium support. You can install using 'pip install pdf2docx' or download it from GitHub, PyPI.

Parse PDF file with PyMuPDF and generate docx with python-docx

Support

Quality

Security

License

Reuse

Support

pdf2docx has a medium active ecosystem.

It has 1432 star(s) with 221 fork(s). There are 19 watchers for this library.

It had no major release in the last 12 months.

There are 45 open issues and 112 have been closed. On average issues are closed in 45 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of pdf2docx is 0.5.6

Quality

pdf2docx has 0 bugs and 0 code smells.

Security

pdf2docx has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

pdf2docx code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

pdf2docx is licensed under the GPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

pdf2docx releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

pdf2docx saves you 1752 person hours of effort in developing the same functionality from scratch.

It has 4631 lines of code, 521 functions and 56 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed pdf2docx and discovered the below as its top functions. This is intended to give you an instant insight into pdf2docx implemented functionality, and help decide if they suit your requirements.

Parse a pdf document
Parse raw pages
Reset the list of instances
Calculate margin
Group horizontal horizontal borders
Sort the images in line order
Sort the instances in reading order
Make a docx file
Lower a number
Convert a PDF file to a PDF file
Assign shapes to given tables
Draw the path
Clean up shapes outside of the page
Extract fonts from a fitz document
Check if the given shape is in text format
Restore blocks from raw blocks
Decorator to plot objects
Return the semantic type of a line
Make docx for this cell
Create docx for each section
Callback called when PDF files are converted to docx folder
Updates the font with the given font
Parse horizontal spacing
Cleans up blank lines
Set the border of the cell
Parse PDF files per CPU

Get all kandi verified functions for this library.

pdf2docx Key Features

No Key Features are available at this moment for pdf2docx.

pdf2docx Examples and Code Snippets

Convert pdf files to docx in python

Python

Lines of Code : 14

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from pdf2docx import Converter
pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'
# convert pdf to docx
cv = Converter(pdf_file)
cv.convert(docx_file, start=0, end=None)
cv.close()

from pdf2docx imp

Community Discussions

Trending Discussions on pdf2docx

python-docx add title in a 2 column layout document

why is python-docx returning cells with text when should be empty?

QUESTION

python-docx add title in a 2 column layout document

Asked 2021-Nov-12 at 14:02

I've been going through the python-docx docs and couldn't find a way to insert a title in a 2 column layout document.

I've tried several methods to get a workaround and none of them worked. Whenever I create a 2 column layout using python-docx and try to add a title for the document, the title is not added to the top center of the document, it actually gets added to the first column on the left.

Below is the code that I am using to generate the document.

...

ANSWER

Answered 2021-Nov-09 at 18:45

You'll need separate sections for the (1-col) title and the (2-col) body. There is a setting on a section to specify the kind of break that precedes it, something like section.start_type = WD_SECTION.CONTINUOUS. I believe that will need to go on the second section

Source https://stackoverflow.com/questions/69896636

QUESTION

why is python-docx returning cells with text when should be empty?

Asked 2021-Oct-04 at 20:51

I have a docx document converted from pdf with pdf2docx library. The result seems good but if I load docx document with python-docx it creates a table with cells that contain texts instead of empty cells. The cells are filled with text from cells that is one row above the particular cells.

Table is look like this:

The table contains three rows. First row should contain cells with values [Barriere, Bonuslevel, Cap, Beobachtungszeitraum, Anfangl] and second and third rows should be empty except for last one column. But if can see in debug that empty cells contain text values like this:

Text Basiswert is in the first cell and in the sixth cell. The sixth cell should be empty. I opened an XML file of Docx document and there is everything ok so I think the problem is in python-docx library. Have anyone ever had the same problem?

Edit: This article comes very valuable:

https://python-docx.readthedocs.io/en/latest/dev/analysis/features/table/cell-merge.html

Basically the copied cells are continuation cells which indicates that cells are merged into horizontal or vertical spans but still I dont know how to read this information from python-docx API?

...

ANSWER

Answered 2021-Oct-04 at 17:24

The addressing of table cells in python-docx is based on the grid layout. Basically the grid is all the cells before any cell merging is done. In the grid layout there are n rows and m columns and m * n cells; each row-column combination/intersection has a cell.

When you address a grid cell that is "merged" into some other cell, then the top-left member of the merged (rectangular) region is returned.

This means that some content is returned more than once if the table includes merged cells.

Source https://stackoverflow.com/questions/69436958

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pdf2docx

You can install using 'pip install pdf2docx' or download it from GitHub, PyPI.
You can use pdf2docx like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.