pdfx | Extract text metadata and references ( pdf url doi | Document Editor library
kandi X-RAY | pdfx Summary
kandi X-RAY | pdfx Summary
Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Create argument parser
- Check URLs
- Return the status code of a given URL
- Sanitize url
- Wait for all tasks to finish
- Color print a string
- Download PDFs to specified directory
- Download one or more URLs
- Return a list of all references
- Return the text output of a PDF
- Return the number of references in the file
- Return a dictionary containing the meta - information of the description
- Parses the value element
- Parse tag element
- Print text to the console
- Extract the arXiv
- Parse a requirements file
- Cleanup metadata for all keys
- Remove trailing whitespace
- Wait for completion
- Print an error code to stderr
pdfx Key Features
pdfx Examples and Code Snippets
Community Discussions
Trending Discussions on pdfx
QUESTION
I am getting my form data from frontend and reading it using fast api as shown below:
...ANSWER
Answered 2021-Nov-13 at 19:39fastapi gives you a SpooledTemporaryFile
. You may be able to use that file object directly if there is some api in pdfx which will work on a File()
object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:
QUESTION
I'm aware that PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc.
I've been trying to convert some PDF files into XML using PDFX on python but http://pdfx.cs.man.ac.uk/ is not responding.
The code I use for the conversion is:
response = requests.post('http://pdfx.cs.man.ac.uk/', headers=headers, data=data)
Is it still available? Is there any other option to convert the documents reconstructing the structure of scholarly articles?
Thanks in advance!
...ANSWER
Answered 2021-Sep-15 at 15:21From the research I've been doing this days, I could find a similar tool called GROBID.
Home page: https://grobid.readthedocs.io/en/latest/
GitHub: https://github.com/kermitt2/grobid
Is a machine learning software for extracting information from scholarly documents
QUESTION
I have the following XML file:
...ANSWER
Answered 2021-May-30 at 16:16The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml()
function.
QUESTION
I'm getting this error from cete DynamicPdf, any help would be appreciated, I have posted it on their forum but not getting a response. It happens when I call the draw method to create a Pdf after merging several documents into one. Is there a workaround for it?
Stacktrace follows:
Traceback (most recent call last): File "(unknown)", line 235, in ceTe.DynamicPDF.Imaging.TiffImageData.a() File "(unknown)", line 19, in ceTe.DynamicPDF.Imaging.TiffImageData.c() File "(unknown)", line unknown, in ceTe.DynamicPDF.Imaging.TiffImageData.Draw(ceTe.DynamicPDF.IO.OperatorWriter writer, System.Single pdfX, System.Single pdfY, System.Single width, System.Single height) File "(unknown)", line 617, in ceTe.DynamicPDF.PageElements.Image.DrawRotated(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line 103, in ceTe.DynamicPDF.PageElements.RotatingPageElement.Draw(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line unknown, in ceTe.DynamicPDF.PageElements.Group.Draw(ceTe.DynamicPDF.IO.PageWriter writer) File "(unknown)", line 14, in ceTe.DynamicPDF.Page.b(ceTe.DynamicPDF.IO.PageWriter A_0) File "(unknown)", line 136, in ceTe.DynamicPDF.Page.fd(ceTe.DynamicPDF.IO.DocumentWriter A_0, System.Int32 A_1, System.Int32 A_2) File "(unknown)", line 178, in ceTe.DynamicPDF.Page.a(ceTe.DynamicPDF.IO.DocumentWriter A_0, System.Int32 A_1, System.Int32 A_2, System.Int32 A_3) File "(unknown)", line 166, in zz93.b1.f() File "(unknown)", line 419, in zz93.b1.Draw() File "(unknown)", line 1, in ceTe.DynamicPDF.Document.Draw(System.IO.Stream stream) File "(unknown)", line 15, in ceTe.DynamicPDF.Document.Draw() File "C:\Users\Simon\Source\Workspaces\PubManager\PubManager\PdfManager\Printing.cs", line 1505, in PubManager.PdfManager.Printing+d__3.MoveNext() File "(unknown)", line 12, in System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() File "(unknown)", line 40, in System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) File "C:\Users\Simon\Source\Workspaces\PubManager\PubManager\PubManager\Controllers\PageLayoutController.cs", line 598, in PubManager.Controllers.PageLayoutController+d__24.MoveNext() File "(unknown)", line 12, in System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() File "(unknown)", line 40, in System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) File "(unknown)", line unknown, in System.Web.Mvc.Async.TaskAsyncActionDescriptor.EndExecute(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass37.b__36(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethod(System.IAsyncResult asyncResult) File "(unknown)", line 20, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters.b__3d() File "(unknown)", line 134, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters+<>c__DisplayClass46.b__3f() File "(unknown)", line 134, in System.Web.Mvc.Async.AsyncControllerActionInvoker+AsyncInvocationWithFilters+<>c__DisplayClass46.b__3f() File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethodWithFilters(System.IAsyncResult asyncResult) File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass21+<>c__DisplayClass2b.b__1c() File "(unknown)", line unknown, in System.Web.Mvc.Async.AsyncControllerActionInvoker+<>c__DisplayClass21.b__1e(System.IAsyncResult asyncResult) ceTe.DynamicPDF.Imaging.ImageParsingException: TIFF Compression value 8 (Flate/Deflate) is not supported.
...ANSWER
Answered 2021-Apr-26 at 15:14It looks like TIFF images with compression value 7 or above are not suported. Have a look at this thread: https://www.dynamicpdf.com/forums/core-suite-for-net-v9/tiff-compression-value-7-extended-jpeg-not-supported
QUESTION
My code is working correctly to scour a directory of PDFs, download weblinks embedded within those PDFs, and sequentially name them with appropriate file extension.
That being said - I am getting a few random files that download but DON'T have an extension associated with them. In doing quality checks, I have all the attachments that matter - these extra files are truly garbage.
Is there a way to not download them or build in a check in the code so that I don't end up with these phantom files?
...ANSWER
Answered 2020-Jul-17 at 19:58It's not clear which part of your code produces these "phantom" files, but anyplace you want to avoid downloading a file which doesn't have an extension, you can make the download conditional. If the component after the last slash doesn't contain a dot, do nothing.
QUESTION
I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.
For reference:
Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG
Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG
Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.
Help!
...ANSWER
Answered 2020-Jul-17 at 15:06I prefer to use requests
(https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget
and I got the same error (might be linked to user-agent HTTP headers used by wget
).
wget
and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden- HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
The good thing with requests
is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).
QUESTION
I have a dataset which is generated using this
...ANSWER
Answered 2020-May-18 at 07:59We can get the data in long format, remove rows with 0 values, count number of rows for each day
and class
and get data in wide format again.
QUESTION
I have a pdf (created with latex with \usepackage[a-2b]{pdfx}
) where I am able to correctly copy & paste ligatures, i.e., "fi" gets pasted in my text editor as "fi". The pdf is quite large, so I'm trying to reduce its size with this ghostscript command:
ANSWER
Answered 2020-Jan-07 at 19:27Without seeing your original file, so that I can see the way the text is encoded, its not possible to be definitive. It certainly is not possible to have the pdfwrite device 'preserve the information', for an explanation, see here.
If you original PDF file has a ToUnicode CMap then the pdfwrite devcie should use that to generate a new ToUnicode CMap in the outptu file, maintaing cut&paste/search. If it doesn't then the conversion process will destroy the encoding. You might be able to get an improvement in results by setting SubsetFonts to false, but its just a guess without seeing an example.
My guess is that your original file doesn't have a ToUnicode CMap, which means that its essentially only working by luck.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install pdfx
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page