pdfx | Extract text metadata and references ( pdf url doi | Document Editor library

by metachris Python Version: 1.4.1 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(8)Vulnerabilities Install Support

kandi X-RAY | pdfx Summary

pdfx is a Python library typically used in Editor, Document Editor applications. pdfx has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can install using 'pip install pdfx' or download it from GitHub, PyPI.

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.

Support

Quality

Security

License

Reuse

Support

pdfx has a medium active ecosystem.

It has 966 star(s) with 112 fork(s). There are 38 watchers for this library.

It had no major release in the last 12 months.

There are 20 open issues and 23 have been closed. On average issues are closed in 713 days. There are 5 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of pdfx is 1.4.1

Quality

pdfx has 0 bugs and 0 code smells.

Security

pdfx has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

pdfx code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

pdfx is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

pdfx releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

It has 796 lines of code, 54 functions and 13 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed pdfx and discovered the below as its top functions. This is intended to give you an instant insight into pdfx implemented functionality, and help decide if they suit your requirements.

Create argument parser
Check URLs
Return the status code of a given URL
Sanitize url
Wait for all tasks to finish
Color print a string
Download PDFs to specified directory
Download one or more URLs
Return a list of all references
Return the text output of a PDF
Return the number of references in the file
Return a dictionary containing the meta - information of the description
Parses the value element
Parse tag element
Print text to the console
Extract the arXiv
Parse a requirements file
Cleanup metadata for all keys
Remove trailing whitespace
Wait for completion
Print an error code to stderr

Get all kandi verified functions for this library.

pdfx Key Features

No Key Features are available at this moment for pdfx.

pdfx Examples and Code Snippets

No Code Snippets are available at this moment for pdfx.

Community Discussions

Trending Discussions on pdfx

Extract Hyperlink from a spool pdf file in Python

PDF to XML conversion using PDFX http://pdfx.cs.man.ac.uk/

Convert in R a XML with ASCII Entity Names to a basic XML

Getting an error from cete Dynamic pdf: ceTe.DynamicPDF.Imaging.ImageParsingException: TIFF Compression value 8 (Flate/Deflate) is not supported

How do you skip over files with no extension when downloading them?

How do you correctly parse web links to avoid a 403 error when using Wget?

Formatting a table based on Criteria in R

Ghostscript pdf conversion makes ligatures unable to copy & paste

QUESTION

Extract Hyperlink from a spool pdf file in Python

Asked 2021-Nov-13 at 19:39

I am getting my form data from frontend and reading it using fast api as shown below:

...

ANSWER

Answered 2021-Nov-13 at 19:39

fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:

Source https://stackoverflow.com/questions/69956921

QUESTION

PDF to XML conversion using PDFX http://pdfx.cs.man.ac.uk/

Asked 2021-Sep-15 at 15:21

I'm aware that PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc.

I've been trying to convert some PDF files into XML using PDFX on python but http://pdfx.cs.man.ac.uk/ is not responding.

The code I use for the conversion is:

response = requests.post('http://pdfx.cs.man.ac.uk/', headers=headers, data=data)

Is it still available? Is there any other option to convert the documents reconstructing the structure of scholarly articles?

Thanks in advance!

...

ANSWER

Answered 2021-Sep-15 at 15:21

From the research I've been doing this days, I could find a similar tool called GROBID.

Home page: https://grobid.readthedocs.io/en/latest/

GitHub: https://github.com/kermitt2/grobid

Is a machine learning software for extracting information from scholarly documents

Source https://stackoverflow.com/questions/68942805

QUESTION

Convert in R a XML with ASCII Entity Names to a basic XML

Asked 2021-May-30 at 21:50

I have the following XML file:

...

ANSWER

Answered 2021-May-30 at 16:16

The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.

Source https://stackoverflow.com/questions/67762928

QUESTION

Getting an error from cete Dynamic pdf: ceTe.DynamicPDF.Imaging.ImageParsingException: TIFF Compression value 8 (Flate/Deflate) is not supported

Asked 2021-Apr-26 at 15:14

I'm getting this error from cete DynamicPdf, any help would be appreciated, I have posted it on their forum but not getting a response. It happens when I call the draw method to create a Pdf after merging several documents into one. Is there a workaround for it?

Stacktrace follows:

Traceback (most recent call last): File "(unknown)", line 235, in ceTe.DynamicPDF.Imaging.TiffImageData.a() File "(unknown)", line 19, in ceTe.DynamicPDF.Imaging.TiffImageData.c() File "(unknown)", line unknown, in ceTe.DynamicPDF.Imaging.TiffImageData.Draw(ceTe.DynamicPDF.IO.OperatorWriter writer, System.Single pdfX, System.Single pdfY, System.Single width, System.Single height) File "(unknown)", line 617, in ceTe.DynamicPDF.PageElements.Image.DrawRotated(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line 103, in ceTe.DynamicPDF.PageElements.RotatingPageElement.Draw(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line unknown, in ceTe.DynamicPDF.PageElements.Group.Draw(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line 14, in ceTe.DynamicPDF.Page.b(ceTe.DynamicPDF.IO.PageWriter A_0) File "(unknown)", line 136, in ceTe.DynamicPDF.Page.fd(ceTe.DynamicPDF.IO.DocumentWriter A_0, System.Int32 A_1, System.Int32 A_2) File "(unknown)", line 178, in ceTe.DynamicPDF.Page.a(ceTe.DynamicPDF.IO.DocumentWriter A_0, System.Int32 A_1, System.Int32 A_2, System.Int32 A_3) File "(unknown)", line 166, in zz93.b1.f() File "(unknown)", line 419, in zz93.b1.Draw() File "(unknown)", line 1, in ceTe.DynamicPDF.Document.Draw(System.IO.Stream stream) File "(unknown)", line 15, in ceTe.DynamicPDF.Document.Draw() File "C:\Users\Simon\Source\Workspaces\PubManager\PubManager\PdfManager\Printing.cs", line 1505, in PubManager.PdfManager.Printing+d__3.MoveNext() File "(unknown)", line 12, in System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() File "(unknown)", line 40, in System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) File "C:\Users\Simon\Source\Workspaces\PubManager\PubManager\PubManager\Controllers\PageLayoutController.cs", line 598, in PubManager.Controllers.PageLayoutController+d__24.MoveNext() File "(unknown)", line 12, in System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() File "(unknown)", line 40, in System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) File "(unknown)", line unknown, in System.Web.Mvc.Async.TaskAsyncActionDescriptor.EndExecute(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass37.b__36(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethod(System.IAsyncResult asyncResult) File "(unknown)", line 20, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters.b__3d() File "(unknown)", line 134, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters+<>c__DisplayClass46.b__3f() File "(unknown)", line 134, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters+<>c__DisplayClass46.b__3f() File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethodWithFilters(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass21+<>c__DisplayClass2b.b__1c() File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass21.b__1e(System.IAsyncResult asyncResult) ceTe.DynamicPDF.Imaging.ImageParsingException: TIFF Compression value 8 (Flate/Deflate) is not supported.

...

ANSWER

Answered 2021-Apr-26 at 15:14

It looks like TIFF images with compression value 7 or above are not suported. Have a look at this thread: https://www.dynamicpdf.com/forums/core-suite-for-net-v9/tiff-compression-value-7-extended-jpeg-not-supported

Source https://stackoverflow.com/questions/67268426

QUESTION

How do you skip over files with no extension when downloading them?

Asked 2020-Jul-17 at 19:58

My code is working correctly to scour a directory of PDFs, download weblinks embedded within those PDFs, and sequentially name them with appropriate file extension.

That being said - I am getting a few random files that download but DON'T have an extension associated with them. In doing quality checks, I have all the attachments that matter - these extra files are truly garbage.

Is there a way to not download them or build in a check in the code so that I don't end up with these phantom files?

...

ANSWER

Answered 2020-Jul-17 at 19:58

It's not clear which part of your code produces these "phantom" files, but anyplace you want to avoid downloading a file which doesn't have an extension, you can make the download conditional. If the component after the last slash doesn't contain a dot, do nothing.

Source https://stackoverflow.com/questions/62957712

QUESTION

How do you correctly parse web links to avoid a 403 error when using Wget?

Asked 2020-Jul-17 at 15:50

I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.

For reference:

Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG

Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG

Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.

Help!

...

ANSWER

Answered 2020-Jul-17 at 15:06

I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).

wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).

Source https://stackoverflow.com/questions/62955392

QUESTION

Formatting a table based on Criteria in R

Asked 2020-May-18 at 09:54

I have a dataset which is generated using this

...

ANSWER

Answered 2020-May-18 at 07:59

We can get the data in long format, remove rows with 0 values, count number of rows for each day and class and get data in wide format again.

Source https://stackoverflow.com/questions/61864274

QUESTION

Ghostscript pdf conversion makes ligatures unable to copy & paste

Asked 2020-Jan-07 at 19:27

I have a pdf (created with latex with \usepackage[a-2b]{pdfx}) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:

...

ANSWER

Answered 2020-Jan-07 at 19:27

Without seeing your original file, so that I can see the way the text is encoded, its not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information', for an explanation, see here.

If you original PDF file has a ToUnicode CMap then the pdfwrite devcie should use that to generate a new ToUnicode CMap in the outptu file, maintaing cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but its just a guess without seeing an example.

My guess is that your original file doesn't have a ToUnicode CMap, which means that its essentially only working by luck.

Source https://stackoverflow.com/questions/59632621

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install pdfx

Grab a copy of the code with easy_install or pip, and run it:.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: